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ABSTRACT 

Parallel  implementations  of  Fast  Fourier  Transforms 
(FFTs)  and  other  fast  transforms  are  represented  using 
factored,  partitioned  matrices.   The  factored  matrix 
description  of  a  distributed  FFT  is  introduced  using  a 
decimation-in-time  (DIT)  FFT  algorithm  suitable  for 
implementation  on  a  distributed  signal  processor. 

The  heart  of  the  matrix  representation  of  distributed 
fast  transforms  is  the  use  of  permutatons  of  an  NxN  identity- 
matrix  to  describe  the  required   inter-processor  data 
transfers  on  the  Butterfly  Network.   The  properties  of 
these  "transfer  matrices"  and  the  resulting  output  ordering 
are  discussed  in  detail.   The  f actored-matrix  representation 
is  then  used  to  show  that  the  Fast  Hartley  Transform  (FHT) 
and  the  Walsh-Hadamard  Transform  (WHT)  are  supported  by 
the  Butterfly  Network. 


TABLE  OF  CONTENTS 

I.   INTRODUCTION 7 

A.  BACKGROUND 8 

B.  RESEARCH  OBJECTIVES 10 

II.   FFT  MATRIX  FACTORIZATIONS 13 

A.  SHORTHAND  NOTATION 16 

B.  FACTORED-MATRIX  REPRESENTATION  OF 
SINGLE-PROCESSOR  FFT  (SPFFT) 20 

C.  FACTORED-MATRIX  REPRESENTATION  OF 
MULTI-PROCESSOR  FFT  (MPFFT) 30 

D.  PROPERTIES  OF  TRANSFER  MATRICES 38 

1.  Transfer  Matrix  Patterns 38 

2.  Algorithm  for  Generating 

Transfer  Matrices 44 

3.  Orthogonality  of  Transfer  Matrices 47 

4.  Pseudo-normal  Ordering 49 

5.  Selection  of  Transfer  Matrices 54 

E.  GENERALIZED  MULTI-PROCESSOR  FFTS 57 

1.  DIT  MPFFT  with  Bit-reversed  Input 59 

2.  DIT  MPFFT  with  PN  Input 68 

3.  Pseudo-bit-reversed  Ordering 73 

4.  DIF  MPFFT  Algorithms 76 

III.   OTHER  FAST  TRANSFORMS  ON  THE  BUTTERFLY  NETWORK-  80 

A.   FAST  HARTLEY  TRANSFORM  (FHT) 80 

1.  Single-Processor  FHT  (SPFHT) 

Algorithm 81 

2.  Multi-processor  FHT  (MPFHT)  Algorithm —  88 

5 


3.   Transfer  Patterns  and  Output  Ordering —  101 

B.   FAST  WALSH-HADAMARD  TRANSFORM  (WHT) 107 

1  .   Single-Processor  Walsh  Ordered 

Transform  (WTHT)w 107 

2.  Multi-Processor  Walsh  Ordered 

Transform  (MPWHT)w 114 

3.  Single-Processor  Hadamard  Ordered 
Transform  (WTHT)h 122 

4.  Multi-Processor  Hadamard  Ordered 
Transform  (MPWHT)h 124 

IV.   COMBINED  ALGORITHM  FOR  GENERATING  TRANSFER 

MATRICES  AND  OUTPUT  ORDERING 133 

V.   CONCLUSIONS 138 

LIST  OF  REFERENCES  — 140 

3IBLI0GRAPHY 141 

INITIAL  DISTRIBUTION  LIST 142 


I.   INTRODUCTION 

The  Fast  Fourier  Transform  (FFT)  has  become  popular 
because  it  computes  the  Discrete  Fourier  Transform  (DFT) 
much  faster  than  direct  evaluation  of  the  DFT  definition. 
While  evaluation  of  the  DFT  from  the  definition  requires 
N2  complex  multiplications,  the  FFT  algorithm  produces 
the  same  result  with  the  number  of  multiplications 
proportional  to  N  log2  N.   The  DFT  can  be  expressed  as  a 
vector-matrix  equation: 

X(k)  =  Wnk  x(n),     n,k-0,l N-l  (1.1) 

where    W    =    exp  [ -j  27T/N  ]  .  (1.2) 

The  Nxl  column  vectors  x(n)  and  X(k)  are  the  input  sequence 

and  its  DFT,  respectively.   The  record  length,  N,  is  assumed 

nk 
to  be  a  power  of  2.   The  NxN  matrix  W    represents  the  DFT 

operation  on  the  input  data  sequence  that  yields  the 
transformed  output  sequence.   This  thesis  uses  matrix 
factorization  of  the  DFT  matrix  to  describe  a  parallel 
multi-processor  implementation  of  the  FFT.   After  a  detailed 
investigation  into  the  properties  of  the  necessary  inter- 
processor  data  transfers,  the  Fast  Hartley  Transform  (FHT) 
and  the  Walsh-Hadamard  Transform  (WHT)  are  also  investigated 
for  compatibility  with  the  parallel-processing  network. 


A.   BACKGROUND 

The  Cooley-Tukey  FFT  algorithm  can  be  conveniently 
viewed  as  a  factorization  of  the  DFT  matrix  which  it 
implements.   Many  authors  have  used  this  approach  to 
illustrate  the  computational  savings  afforded  by  the 
FFT  [Refs.  1-5].   Each  factor  of  the  f ac tored-matr ix 
representation  corresponds  to  one  butterfly  stage  of  the 
FFT.   The  product  of  matrix  factors  is  equivalent  to 
evaluating  the  DFT  by  its  definition,  though  the  rows  or 
columns  of  the  FFT  matrix  product  are  reordered  in 
comparison  to  the  DFT  matrix.   The  row  (column)  reordering 
corresponds  to  permutation  of  the  output  (input)  sequence 
ordering,  respectively.   The  keys  to  the  speed  advantage 
of  the  FFT  are  that  the  matrix  factors  are  sparse 
[Ref.  3:  p.  328],  and  that  the  complex  weighting  factors 
are  periodic  [Ref.  4:  p.  358].   These  properties  permit 
the  calculation  of  the  DFT  coefficients  to  be  carried  out 
i teratively . 

A  more  recent  approach  to  enhance  signal  processing 
of  large  data  sets  is  parallel  processing.   This  concept 
was  used  to  develop  the  Distributed  Signal  Processor  (DISP) 
at  MIT  Lincoln  Laboratory  [Ref.  6].   The  DISP  uses  nodal 
processors  interconnected  by  a  Butterfly  Network  to  perform 
real-time  signal  processing.   The  input  data  set  is  first 
divided  equally  among  the  processors.   Then  all  processors 


simultaneously  perform  the  same  operations  on  different 
segments  of  data.   Finally,  all  processors  simultaneously 
exchange  partially-processed  data  through  the  Butterfly 
Network.   The  last  two  steps  are  repeated  until  the  task  is 
completed.   Since  many  signal  processing  applications  involve 
convolution  and  FFT  algorithms,  the  inter-processor  network 
was  selected  for  efficient  handling  of  these  algorithms. 
The  Butterfly  Network  allows  all  nodal  processors  to 
simultaneously  transfer  the  necessary  data  among  themselves 
without  conflict  [Ref.  6:  p.  1].   References  6-10  discuss 
in  detail  the  architecture  of  the  DISP  and  the  properties 
of  the  Butterfly  Network.   It  is  assumed  that  an  FFT 
algorithm  is  carried  out  in  parallel  by  P  active  processors 
to  compute  the  DFT  of  a  set  of  N  data  points.   It  is 
assumed  that  N  and  P  are  integer  powers  of  2,  N>2P 
[Ref.  7:  p. 3].   Therefore,  each  processor  contains 
M=N/P  points,  where  M  is  also  a  power  of  2. 

A  new  algorithm  for  the  computation  of  power  spectra 
is  the  Fast  Hartley  Transform  (FHT)  introduced  by 
Bracewell.   Requiring  only  real  arithmetic  computations, 
the  FHT  performs  even  faster  than  the  FFT  [Refs.  11-12]. 
The  Walsh-Hadamard  Transform  (WHT) ,  on  the  other  hand, 
uses  Walsh  functions  (square  waves)  as  its  basis 
functions.   Since  the  Walsh  functions  are  square  waves, 
they  take  on  only  two  values,  namely  +1  or  -1  [Ref.  13:  p.  225] 


Both  the  FHT  and  WHT  are  studied  for  parallel  implementation 
using  the  Butterfly  Network. 

B.   RESEARCH  OBJECTIVES 

The  primary  objective  of  this  research  is  to  investigate 
the  suitability  of  the  Butterfly  Network  for  multi-processor 
implementation  of  fast  transform  algorithms.   This  thesis 
develops  a  matrix  representation  of  the  multi-processor 
FFT  (MPFFT)  suitable  for  implementation   on  the  Distributed 
Signal  Processor.   The  MPFFT  algorithm  is  based  on  a  radix-2, 
in-place,  decimation-in-time  (BIT)  algorithm  with  input  in 
bit-reversed  order  [Ref.  6:  p.  28],   It  is  felt  that  the 
matrix  representation  provides  insight  into  the  output 
reordering  caused  by  the  required  transfers  on  the  Butterfly 
Network.   It  will  be  shown  that  the  transfer  patterns  can 
be  represented  by  permutations  of  an  NxN  identity  matrix. 
This  makes  the  "transfer  matrices"  easy  to  generate,  and 
leads  to  a  description  of  the  output  reordering  as 
a  function  of  the  number  of  active  processors.   After  the 
transfer  patterns  and  properties  of  the  transfer  matrices 
are  examined  in  detail,  the  FHT  and  WHT  algorithms  are 
investigated  for  potential  support  by  the  Butterfly  Network. 
To  permit  concentration  on  the  matrix  manipulations,  all 
normalizing  factors  of  (1/N)  are  neglected.   With  the  DFT 
definition  used  here,  a  factor  of  (1/N)  is  required  in  the 
definition  of  the  inverse  DFT.   Other  fast  transforms  also 
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require  normalizing  factors  in  either  the  forward  or 
inverse  transform  definitions.   This  factor,  in  either 
direction,  can  easily  be  inserted,  as  necessary,  as  a 
gain  multiplying  each  output  element.   In  the  MPFFT  matrix 
representations  of  Section  II,  the  column  indices  are 
repeatedly  referred  to  as  time  indices.   While  this 
terminology  is  correct  for  the  first  stage,  it  is  simply 
a  convenience  when  referring  to  the  partially-processed 
data  at  the  second  and  successive  computation  stages.   This 
terminology  is  used  because  it  is  felt  that  it  helps  rather 
than  hinders  the  discussion. 

The  multi-processor  FFT  is  presented  in  Section  II  as 
a  partitioning  of  the  single-processor  DIT  FFT.   The 
required  inter-procesor  data  transfers  are  described  by 
matrices,  and  their  properties  are  examined  in  detail.   An 
algorithm  is  presented  for  generating  the  transfer  matrices 
as  permutations  of  identity  matrices.   The  output  reordering 
caused  by  the  transfers  is  then  discussed,  and  an  algorithm 
is  developed  for  determining  output  ordering  as  a  function 
of  N  and  P.   Alternative  transfer  patterns  are  investigated, 
and  the  pattern  chosen  in  reference  6  is  justified.   The 
decimation-in-f requency  (DIF)  FFT  is  examined,  and  a 
general  matrix  equation  for  the  MPFFT  is  developed. 

Section  III  investigates  several  other  fast  transforms 
for  compatibility  with  the  Butterfly  Network.   The 
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relatively  new  Fast  Hartley  Transform  (FHT)  is  partitioned 
for  parallel  processing  using  the  Distributed  Signal 
Processor.   Finally,  the  Walsh  and  Hadamard  orderings  of 
the  Walsh-Hadamard  Transform  (WT)  are  studied.   Section 
IV  presents  a  unified  approach  to  the  algorithms  for 
transfer  matrix  generation  and  corresponding  output 
reordering,  and  also  shows  how  the  two  separate  algorithms 
from  Section  II  can  be  combined  into  a  single  algorithm. 


12 


II.   FFT  MATRIX  FACTORIZATIONS 

This  section  introduces  shorthand  notation  to  simplify 
the  matrix  manipulations  to  follow,  and  presents  a  repre- 
sentation of  a  distributed  FFT  algorithm  using  partitioned, 
factored  matrices.   The  matrix  version  of  the  distributed 
FFT  will  be  shown  to  involve  permutations  of  identity 
matrices  to  describe  the  necessary  data  transfers.   These 
"transfer  matrices"  possess  many  interesting  properties 
that  will  be  discussed  in  detail. 

Many  authors  have  used  matrices  to  describe  FFT 
algorithms  [Ref.  1:  pp.  148-149  and  2:  pp.  63-69].   In  fact, 
the  initial  derivation  of  the  FFT  depended  on  the  fact  that 
an  NxN  matrix,  none  of  whose  elements  is  zero,  can  be 
factored  into  sparse  matrices  with  many  zeros  [Ref.  3:  p.  297] 
A  Discrete  Fourier  Transform  defined  by 


N-l 
X(k)  =   I   x(n)  exp[-j(2ir/N)nk]   k  =  0 ,  1  ,  .  .  .  ,  N-l      (2.1) 
n  =  0 


can  be  represented  as  a  vector  matrix  equation 


E 
X(k)  =  W~  x(n),  (2.2) 


where  W  =  exp( -j  2  ir/N )  ,  and  E  is  a  matrix  of  exponents.   The 
input  sequence  is  the  vector  x(n),  given  by 
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(n)  =  [x(0)  x(l)  ...  x(N-l)] 


(2.3) 


and  the  transformed  output  sequence  X(k)  is 


X(k)  =  [X(0)  X(l)  ...  X(N-1)]\ 


(2.4) 


The  vectors  x(n)  and  X(k)  may  represent  sequences  in  either 

normal  order,  or  in  digit-reversed  order.   In  several  forms  of 

the  radix-2  FFT  algorithms  a  normal-ordered  input 

data  sequence  produces  output  in  bit-reversed  order. 

Similarly,  a  bit-reversed  input  data  sequence  yields  a 

transformed  output  sequence  in  normal  order. 

E 
The  matrix  W"  is  a  two-dimensional  array  representing 

the  transform  operation.   It  has  row  indices  k  =  0 , 1 ,  .  .  .  , N-l , 

and  column  indices  n=0 , 1 , . . . , N-l .   Since  the  "twiddle  factor" 

W  =  exp(-j2TT/N)  is  common  to  all  the  elements  of  the  DFT 

E 
matrix  W ~ ,  a  matrix  of  exponents  E  can  be  written,  whose 

elements  are  functions  of  k  and  n.   For  example,  for  N=8 , 

(2.1)  gives 


X(0)  =  [W°  W°  W°  W°  W°  W°  W°  W°]  x(n) 
X(l)  =  [W°  W1  W2  W3  W4  W5  W6  W7]  x(n) 
X(2)  =  [W°  W2  W4  W6  W°  W2  W4  W6]  x(n) 


(2.5) 


X(7)  =  [W°  W7  W6  W5  W4  W3  W2  W1]  x(n) 
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In  matrix  form 


X(k)=W  x(n)= 


and  E  =   k 


0 
1 
2 
3 
4 
5 
6 
7 


X(0) 
X(l) 
X(2) 
X(3) 
X(4) 
X(5) 
X(6) 
X(7) 

0 
0 
0 
0 

0 
0 
0 
0 

0 


w°  w°  w°  w°  w°  w°  w°  w° 

w°  w1  w2  w3  w4  w5  w6  w7 

0   2   4   6  0  4   6 

w  w  w  w  w  W*"  w  w 

w°  w3  w6  w1  w4  w7  w2  w5 

04040404 

W   W   W   W  W  W  WJ  W 

w°  w5  w2  w7  w4  w1  w6  w3 

w°  w6  w4  w2  w°  w6  w4  w2 

w°  w7  w6  w5  w4  w3  w2  w1 


4    5 
0    0 


x(0) 
x(l) 
x(2) 
x(3) 
x(4) 
x(5) 
x(6) 
x(7) 


(2.6) 


(2.7) 


i 


Note  that  each  element  in  the  matrix  of  exponents,  E,  is  the 
product  of  the  row  label,  k ,  and  the  column  label,  n,  computed 

modulo  N  (i.e.,  k'n  mod  N  =  remainder  of  k-n/N).   The  compu- 

nk 
tation  of  the  kn  products  mod  N  is  due  to  the  fact  that  W 

is  periodic  with  period  N,  i.e., 
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w(n+raN)(k+rN)  =  ynk 


m,l  =  0,  +  l  ,  . . 


(2.7a) 


nk 
The  periodicity  of  W    is  one  of  the  keys  to  the  FFT 

[Ref.  4:  p.  358].   In  general,  for  an  N-point  DFT,  the  E 

matrix  of  exponents  is  NxN,  with  entries: 


E  = 


0 

1 

2 

0 

0 

0 

0 

1 

0 

1 

2 

2 

0 

2 

4 

N-2 
N-l 


0  N-2  N-4 
0  N-l  N-2 


N-2  N-l 

0  0 

N-2  N-l 

N-4  N-2 


(2.8) 


A.   SHORTHAND  NOTATION 

One  way  to  derive  FFT  algorithms  is  to  factor  the  DFT 
E 
matrix  W   into  a  product  of  matrices.   This  method  is  also 

well-suited  to  the  description  of  distributed  FFT  algorithms, 

and  will  be  the  method  used  here. 

Elliott  and  Rao  have  developed  a  shorthand  notation  that 

makes  the  matrix  factorization  easy  to  follow  [Ref.  2: 

pp.  47-49].   To  illustrate,  consider  the  4-point  DFT  defined 

by : 
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X(k)  =  W~x(n)     k=0,l,2,3 

n=0,l ,2,3 


(2.9) 


In  matrix  form , 

X(0) 

X(l) 

X(2) 

X(3) 


~w° 

w° 

w° 

w°~] 

^(0) 

w° 

w1 

9 

w3 

x(l) 

w° 

w2 

w° 

w2 

* 

x(2) 

w° 

w3 

w2 

w1 

x(3) 

— 

(2.10) 


This  can  be  manipulated  into  a  decomposition-in-time  (DIT) 

E 
FFT  by  reordering  the  columns  of  W   as  follows: 


X(0) 
X(l) 
X(2) 
X(3) 


w° 

w° 

w° 

w° 

~x(oy 

w° 

w2 

w1 

w3 

x(2) 

w° 

w° 

w2 

w2 

♦ 

x(l) 

w° 

w2 

w3 

w1 

J°l 

1 

1 

1 

1 

x(0) 

1 

-1 

-j 

j 

x(2) 

1 

1 

-1 

i 

-1 

x(D 

1 

-1 

j 

-j 

x(3) 

(2.11) 


where  W  =  exp(-j2  7r/4)  =  -j  ,  and  the  n  index  values  of  the 

input  sequence  have  been  correspondingly  reordered  as  well 

E 
The  matrix  W ~  can  be  factored  to  yield 


W"  =  W 


?2 


W 


(2.12) 
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-1       ^2 
where  W    and  W    will  represent  the  first  and  second 

stages,  respectively,  of  butterfly  computations  in  the 

DIT  FFT.   Let 


h    - 


and  E,  = 


0 

0 

• 

• 

0 

2 

• 

• 

• 

• 

0 

0 

0 

2 

(2.13) 


Then  , 


W 


w 

0 

w 

0 

0 

w° 

0 

w 

w° 

0 

w2 

0 

0 

w° 

0 

w 

1 

0 

1 

0 

0 

1 

0 

-j 

1 

0 

-1 

0 

0 

1 

0 

j 

(2.14) 


and 


E 


W 


w 
v« 

0 

0 


w 
w 

0 

0 


0 
0 

w 


0 


0 
0 

w( 
w: 


1 

1 

0 

0 

1 

-1 

0 

0 

0 

0 

1 

1 

0 

0 

1 

-1 

where  the  shorthand  notation  of  a  dot  (no  entry)  in  the 

ET  matrices  of  exponents  corresponds  to  an  element  value 

E 

~L 
of  zero  in  the  W    matrices.   Performing  the  matrix 

multiplication  of  (2.12)  yields 


18 


w° 

0 

w° 

0 

w° 

w° 

0 

0 

0 

w° 

0 

w1 

w° 

w2 

0 

0 

w?   - 

i2  §i 

=  w  •  w  = 

w° 

0 

9 

0 

. 

0 

0 

w° 

w° 

0 

w° 

0 

w-3 

_0 

0 

w° 

n 

T70  +  0       „0  +  0      T70  +  0      Y70  +  0 
W  w  w  w 


I70  +  0      ..0  +  2      „l+0 

Www 


.1  +  2 


„0+0  V70  +  0  V72  +  0  T72  +  0 

w  w  w  w 

t70  +  0  T70  +  2  T73  +  0  T73  +  2 

W  W  W  w 


w°     w°     w°     w° 


w°     w2 
w°     w° 


1  3 


w       w 


0^31 


(2.15) 
Note  Chat  W~  is  factored  such  that  only  one  nonzero  entry 

h 

per  row  of  W    is  multiplied  by  only  one  nonzero  entry  of 

h 

any  given  column  of  W   .   This  result  is  true  for  all 

factored  FFT  matrices  [Ref.  2:  p.  48].   Since  W  is  a 
common  factor,  the  exponents  simply  add  (modulo  N)  when  the 

o      r\      pa  -4-  r\ 

matrix  multiplication  is  performed,  i.e.,  W  • W  =W 
Therefore,  an  easier  way  to  carry  out  the  matrix  multiplica- 
tion of  (2.12)  is  by  the  addition  of  appropriate  exponents 
of  (2.13).   Let  the  shorthand  notation 


n     J—. ,-     i— .  -i 


(2.16) 
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mean  the  matrix  of  exponents  derived  from  the  sum  of 
appropriate  elements  of  the  factored  matrices  of  exponents 
of  (2.13),  where  the  elements  to  be  added  are  defined  by  the 
row-times-column  rule  of  matrix  multiplication.    Thus, 

o       .       o      T 

0    .    1 

0.2. 

0    .    3 


?"?2*?1 


0+0  0+0  0+0  0+0 
0+0  0+2  1+0  1+2 
0+0  0+0  2+0  2+0 
0+0  0+2  3+0  3+2 


0 

0 

- 

• 

0 

2 

• 

• 

• 

• 

0 

0 

_2 

• 

0 

i 

0 

0 

0 

0 

0 

2 

1 

3 

0 

0 

2 

2 

0 

2 

3 

1 

(2.17) 


and  the  result  is  the  same  as  the  exponents  of  (2.15).   It  is 
usually  much  simpler  to  work  with  the  matrices  of  exponents, 
than  to  actually  perform  the  matrix  multiplication.   Since 
all  the  necessary  information  is  contained  in  the  matrices 
of  exponents,  this  section  will  deal  mainly  with  this 
representation . 


B.   FACTORED-MATRIX  REPRESENTATION  OF  SINGLE-PROCESSOR  FFT 
(SPFFT) 

E 
Factoring  the  DFT  matrix  W~  of  (2.2)  and  reordering  the 

rows  and/or  columns  leads  to  various  FFT  algorithms.   The 


20 


factored  FFT  matrices  represent  individual  stages  of  the 
particular  algorithms.   The  concept  of  matrix  factorization 
will  be  developed  in  this  section  to  describe  a 
single-processor,  radix-2,  DIT  FFT. 

To  compute  an  N-point  FFT  (for  N  a  power  of  2),  there 
are  log~N  "computation"  matrix  factors,  each  having  2 
nonzero  elements  per  row.   Each  computation  matrix  represents 
the  butterfly  computations  carried  out  at  a  given  stage. 
The  number  of  nonzero  elements  per  row  depends  on  the  radix 
base  of  the  algorithm.   For  example,  a  radix-3  FFT  compu- 
tation matrix  contains  3  nonzero  elements  per  row,  radix-4 
matrices  have  4  nonzero  elements  per  row,  etc.   A  flow 
diagram  for  an  8-point  radix-2  DIT  FFT  is  shown  in  Figure  2.1 
The  vector-matrix  equation  describing  a  single-processor 
implementation  of  the  8-point  DIT  FFT  is: 


k  E  ~       E  ry  E  -■ 

X(k)  =  W~x(n)  =  W    W    W    x(n)    k=0 , 1 , . . . , 8 

n=0, 1 , ... ,8 


(2.18) 


where  3  butterfly  (computation)  stages  are  necessary  to 

-1    ~2 
accomplish  the  radix-2  FFT.   The  matrix  factors  W   ,  W   , 

h 

and  W    represent  the  computations  performed  at  stages 

1,  2,  and  3,  respectively.   Let  X  (n)  represent  the  output 

of  the  first  stage  of  butterflies  of  Figure  2.1.   Then, 
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X^O)  =  X°(0)  +  W°X°(1) 

XX(1)  =  X°(0)  -  W°X°(1) 

X*(2)  =  X°(2)  +  W°X°(3) 

X1(3)  =  X°(2)  -  W°X°(3) 

Xl(4)  =  X°(4)  +  W°X°(5) 


X:(5) 


Xx(6)  = 

1 


X°(4)  -  W°X°(5) 
X°(6)  +  W°X°(7) 


7)  =  X°(6)  -  W°X°(7) 


W°X°(0 
W°X°(0 
W°X°(2 
W°X°(2 

W°X°(4 
W°X°(4 
W°X°(6 
W°X°(6 


+  W°X°(1) 

4  0 
+  W  XU(1) 

+  W°X°(3) 

4  0 
+  W  XU(3) 

+  W°X°(5) 

+  W  XU(5) 

+  W°X°(7) 

4  0 
+  W  XU(7) 


(2.19) 


where  W4  =  exp  [  -j  (  2tt  /r  )  (  4  )  =-1  ,  W°=l  ,  and  X  (n)=x(n)  is  the 
bit-reversed  input  sequence,  i.e.,  the  input  to  the  first 
butterfly  stage.   In  matrix  form, 


rw°  w° 


1         ?1  0 
X^n)  =  W~iXU(n) 


w°  w4 


w°  w° 

w°  w4 


w 


w 


w 


w 


w( 


w 


0 


w 


w 


X°(n) 


(2.20) 


23 


Similarly,  the  output  of  the  second  stage  of  butterfly 
computations  is: 


X2(0)  =  X^O)  +  W°X1(2)  =  W°X1(0 

X2(l)  =  X1(l.)  +  W2XX(3)  =  W°X1(1 

X2(2)  =  XX(0)  -  W°X1(2)  =  W°X1(0 

2         1         2  1         0  1 

XZ(3)  =  XX(1)  -  WZXX(3)  =  WUX  (1 

X2(4)  =  XX(4)  +  W°X1(6)  =  W°X1(4 

X2(5)  =  XX(5)  +  W2X:(7)  =  W°X1(5 

X2(6)  =  XX(4)  -  W°X1(6)  =  W°X1(4 

X2(7)  =  XX(5)  -  W2X1(7)  =  W°X1(5 


+  W°X1(2) 
+  W2X1(3) 


+  w4-1 


a  (2) 


+  W6X1(3) 
+  W°Ai(6) 
+  W2XX(7) 
+  W4XX(6) 
+  W6X1(7) 


(2.21) 


since  W   =  -W  ,  and  W   =  -W  .   In  matrix  form, 


X2(n)  =  W2X1(n)  = 


0 


w 

0 


w 

0 

w1 

0 


w 

0 


w 

0 

w 

0 

0 

w° 

0 

w 

w° 

0 

w4 

0 

0 

w° 

0 

w 

X1(n) 


(2.22) 
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The  output  of  the  last  stage  is: 


3        2      0  2        0  2 

XJ(0)  =  XZ(0)+WUX  (4)  =  WUX  (0 

3         2      12        0  2 

X°(l)  =  X^(1)+W1X  (5)  =  WUX  (1 

3         2       2  2         0  2 

XJ(2)  =  XZ(2)+WZXZ(6)  =  WUXZ(2 

o  o        o  9  0  2 

XJ(3)  =  X"(3)+WJX~(7)  =  WUX~(3 

X3(4)  =  X2(0)-W°X2(4)  =  W°X2(0 

X3(5)  =  X2(1)-W1X2(5)  =  W°X2(1 

X3(6)  =  X2(2)-W2X2(6)  =  W°X2(2 

3         2       3  2         0  2 

XJ(7)  =  XZ(3)-WJXZ(7)  =  WUXZ(3 


0  2 
+  WUXZ(4) 

+  W!X2(5) 

+  W2X2(6) 

3  ? 
+  WJXZ(7) 

+  W4X2(4) 

+  W5X2(5) 

+  W6X2(6) 

+  W7X2(7) 


(2.23) 


since  W4  =  -W° ,  W5  =  -W1 ,  W6  =  -W2 ,  and  W7  =  -W3 . 


Again,  in  matrix  form, 


X(k)=X3(N)=W~3X2(N)= 


W 
0 
0 
0 

w( 


0    0 
0    0 


0 

0 

w( 

0 
0 
0 

w( 

0 


0 
0 
0 

w( 

0 
0 

0 

w 


w 
0 

0 


w 

0 


0    0 

4 


0 

0 


0 


0    0 


0 
0 

w 

0 
0 
0 

w' 

0 


0 
0 
0 

w 

0 
0 
0 

w 


X2(n) 


(2.24) 
Combining  (2.20),  (2.22),  and  (2.24)  gives  the  overall  matrix 
representation  for  the  8-point  DIT  FFT: 
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§o  o  ?<5  ?o  i  ?Q  Wo  E,   n 

X(k)  =  W  JXZ(n)  =  W  JW  ZXi(n)  =  W  JW  ZW   X  (n)  (2.25) 

where  X(k)  is  the  normal-ordered  output  sequence,  and  X  (n) 
is  the  bit-reversed  input  sequence.   The  factored  matrices 
of  exponents  corresponding  to  the  stages  of  the  DIT  FFT  of 
Figure  2.1  are: 


E,  = 


En   = 


0 

4 

0 

4 

0 

4 

0 

4 

0 

0 
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• 

• 

• 

1 

0 

4 

• 

• 

• 

• 

2 

0 

0 

• 

• 

3 

0 

4 

• 

• 

4 

• 

• 

0 

0 

5 

• 

• 

0 

4 

6 

• 

• 

• 

• 

0 

0 

7 

. 

. 

. 

• 

0 

4 

0    2 
0 

0 

4 

0 


0    0    2    2 


0.0. 

0    .    2 
0.4. 

0    .    6 


(2.26) 
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n 


k 

0 

0 

0 

1 

• 

2 

• 

3 

• 

3 

4 

0 

5 

• 

6 

• 

7 

0    0    0 


1   1 

0 

1 


1       1 


0 


7 


(2.26) 


where  once  again  a  dot  or  no  entrv  in  the  exponent  matrices 
corresponds  to  an  entry  of  zero  in  the  W    matrices. 
The  column  indices  are  discussed  in  a  subsequent  paragraph. 
Performing  the  matrix  mult iDlication  using  the  shorthand 
notation  described  earlier  gives  the  result  shown  in 
Figure  2.2.   Notice  that  the  addition  of  exponents  in  (2.27) 
is  carried  out  mod  N  (N=8).   Also  observe  the  sparseness  of 
the  matrix  factors  E,  ,  E~  ,  and  E~.   Each  matrix  factor 

contains  only  2N  entries,  and  only  the  final  overall  matrix 

2 

of  exponents,  E,  contains  numerical  entries  in  all  NxN=N^ 

elements  . 

A  few  comments  about  the  nature  of  the  DIT  FFT  algorithm 
are  in  order.   When  an  N-point  DFT  is  divided  into  two  DFT's 
of  one-half  the  original  size,  the  time  sequence  separation 
of  the  input  of  either  of  the  smaller  DFTs  is  increased  by  2 
This  corresponds  to  dropping  (decimating)  alternating  inputs 


27 


h 


VBi 


K 

JO   0.    2 

2      0 

0      2 

2_ 

tt 

1  0    4      0    4      0 

4      0 

4_ 

&° 

4     2     6     0    4     2    8 

0 

o    -i  0 

•  i 

1 

0 

0   oj          | 

1 
| 

0 

0 

o'o  oi 

1 

2 

1 
•    0|     • 

0     ."J"   4 

1 

2| 

r- 
■  i 

1 

t 
1 
I 

1 
2 

0     41             1 

f f— 

1   0     0| 

1 

— |— 
1 

1 
2 

0 
0 

41    2    6|             1 
©I   4   4[ 

3 

•    01     • 

61 

1 

3 

.  -i°  ii- 

1 

3 

0 

4  !  6    2  1             1 

4 

1 

1  o 

•!  o 

4 

1             1    0 

01 

4 

I             |   0   0  |   0   0 

5 

1 

1_ 

1     . 
J. 

0  1    • 

2 

5 

!      !o 

«i 

5 

|   0  4  |  2    6 

5 

1 

1   0 

•  1    4 

6 

1          T     " 
1            | 

!° 

0 

6 

1  0   0  j    4   4 

7 

1 
1 

1 
1     ' 

°!  • 

S 

7 

1           1 

1  o 

4 

4 

04      62 

E2*E1 


i 

1 0     o    0    0     1 

111 

£ 

0     4     2 

6    0     4     2    6 

k?0    4      2     6      15     3     7 

0 

0   • !    •    •  |  o 

1 
1 

0 

0  oj   0 

0'             ! 

1             | 

0 

0   01    0   0|    0   01    0   0 

1 

2 

;--°j_l_J 

.     .  1    0    ■  |    ■ 

11    •    • 

•   1    2    • 

1 
2 

0   4|2 
0   0~j   4 

6|             1 

■*i — r— 

1 
2 

0    4|     2    6J     1    5|     3    7 
0   oj"  4   4  j    2    2  f  6    6 

3 

_ .  .  i  .  oj  . 

.1.3 

3 

0   41     6 

21 

3 

0   41    62|    371    15 

4 

0     -j     •     -1    4 

.  I     .     . 

4 

1 

1    0   0  1   0   0 

4 

0   01    0   0|    4   41    44 

5 

•    01     •     •        • 

r-  • 

5,     ■     ■ 

5 

1 

1 

j    04      2    6 

5 

0    4j    2   5l    5    l|    7   3 

6 

•  T6  ■  {*  •' 

'I6    ' 

6 

I 

0   0~[    4   4 

6 

00'    441    561    22 

7 

-     .  1     .     0|     • 

1                , 

•1-7 

7 

1 

|     0    41    6    2 

7 

0   4J    6    2]    7   3J    5    1 

(2.27) 


Figure  2.2   8-Point/ 1 -Processor  DIT  FFT 
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of  the  original  DFT ;  hence  the  name  decimation  in  time.   The 
idea  behind  the  FFT  is  thus  to  break  the  original  N-point 
sequence  into  two  N/2-point  sequences,  the  DFT's  of  which 
can  be  combined  to  give  the  DFT  of  the  original  N-point 
sequence.   Next,  each  N/2-point  sequence  is  decomposed 
into  two  N/4-point  sequences,  and  so  on,  until  2-point 
sequences  result.   The  DFT  of  a  one-point  sequence  is  simply 
the  sample  itself  [Ref.  5:  p.  49].   The  input  sequence 
indices  for  an  N-point  DFT  are  aeparated  by  1  unit,  based 
on  a  normalized  analysis  period,  (record  length),  P  =  1 
second.   This  yields  transformed  output  sequence  frequencies 
separated  by  1  Hz  [Ref.  2:  p.  60].   Thus,  the  column  labels, 
n,  of  the  factored  matrices  of  (2.26)  are  separated  by  larger 
increments  of  time  going  from  right-to-left  in  Figure  2.1. 
That  is,  the  last  stage,  E~,  has  column  labels  separated  by 
1  unit,  E~  has  indices  separated  by  N/4=2units,  and  the  E1 
column  labels  are  separated  by  N/2=4  units.   Performing  the 
matrix  multiplications  of  (2.27)  results  in  a  time  sequence 
mixing  operation,  in  which  the  time  labels  in  the  matrices 
of  exponents  add  .   The  matrices  E~  and  E„  have  sample  periods 
0  and  1  or  0  and  2  units,  respectively  (based  on  an  analysis 
period  of  P  =  1  second).   These  periods  mix  so  that  E~  *  E~ 
has  periods  0,  2,  1,  and  3  seconds.   Likewise,  the  time 
sequence  periods  0  and  4  seconds  in  E,  combine  to  yield 
periods  0,  4,  2,  6,  1,  5,  3,  and  7  seconds  in  the  overall 
matrix,  E.   The  overall  DIT  FFT  matrix  of  exponents,  E  in 
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(2.27)  is  the  same  as  the  DFT  matrix  of  exponents  (2.7), 
except  for  a  reordering  of  the  columns.   Therefore,  the  DIT 
FFT  matrix  can  be  viewed  as  a  DFT  matrix  with  reordered 
columns . 

C.   FACTORED-MATRIX  REPRESENTATION  OF  MULTI-PROCESSOR 
FFT  (MPFFT) 

In  this  section  the  concept  of  matrix  factorization  will 
be  extended  to  the  description  of  distributed  FFT  algorithms, 
with  consideration  of  the  multi-processor  FFT  implemented  on 
the  Butterfly  Network  at  M.I.T.  Lincoln  Laboratory  [Ref.  6]. 

Distributed  FFT  algorithms  can  also  be  described  in  terms 
of  row  or  column  permutations  of  an  overall  DFT  matrix, 
with  the  inclusion  of  matrices  describing  the  transfer(s) 
of  data  among  the  several  processors.   Figure  2.1  can  be 
partitioned  to  describe  an  8-point  FFT  implemented  using  4 
processors.   The  result  of  this  partitioning  is  shown  in 
Figure  2.3. 

Since  the  DIT  FFT  algorithm  combines  data  separated  by 
1  ,  2  ,  4  ,  .  .  .  , N/2  memory  locations  progressing  through  the 
butterfly  stages  of  the  algorithm,  the  required  separation 
will  eventually  exceed  the  size  (M=N/P)  of  the  data  block 
held  within  each  processor.   (N=number  of  input  data  points, 
P-number  of  active  processors;  both  assumed  to  be  a  power  of 
2.)   At  this  point,  and  for  each  subsequent  computation  stage, 
an  inter-processor  data  transfer  must  take  place.   Kirk  and 
Verly  introduced  the  terminology  "pre-transfer"  and 
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"post-transfer"  butterflies  to  describe  these  multi-processor 
operations  [Ref.  7:  pp.  5-6].   The  butterfly  computations 
that  are  accomplished  prior  to  any  data  transfers  are  called 
pre-transfer  butterflies.   Those  which  are  carried  out  after 
data  transfers  are  called  post- transfer  butterflies. 

Partitioning  the  N  data  points  equally  among  P  processors 
leads  to  L";:"  =  logil  (M=N/P)  stages  of  pre-transfer  butterflies, 
and  log^P  post-transfer  butterfly  stages.   Since  each  post- 
transfer  butterfly  stage  requires  one  data  transfer,  there 
are  also  log?P  transfer  stages.   Note  that  the  number  of 
transfers  depends  only  on  the  number  of  processors,   and  not 
on  the  input  data  record  length. 

The  pattern  of  transfers  used  in  reference  7  consists 

of  an  exchange  of  the  lower  half  of  the  data  in  one  processor 

with  thee  upper  half  of  the  data  in  another.   (In  this  context, 

lower  and  upper  refer  to  local  memory  addresses,  and  not  to 

relative  memory  positions.)   The  destination  processor  in  a 

given  transfer  is  determined  by  the. cube,  function,  which 

complements  the  it_h  bit  of  the  source  processor's  binary 

address  [Ref.  8:  p.  224].   For  example,  consider  processor 

Pp.  in  a  four  processor  (P  =  4)  network.   The  processor's  binary 

address  is  00.   The  cuben  function  is  01,  and  the  cube. 

function  is  10.   The  destination  processors  in  the  first 

data  transfer  are  determined  using  the  cube„  function,  those 

in  the  second  transfer  by  the  cube,  function.   In  general, 

the  destination  processors  in  the  it_h  transfer  are  determined 

by  the  cube,.  ..  function   [Ref.  9:  p.  74]. 
( l-l  ) 
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To  determine  whether  a  processor  transfers  the  upper  or 
lower  half  of  its  data,  processors  are  designated  as 
"upper-half"  or  "lower-half"  processors  at  each  transfer 
stage.   At  the  itji  transfer  stage,  the  (i-l)s_t  bit  of  the 
processor's  binary  address  is  inspected.   If  this  bit  is 
zero,  the  processor  is  designated  as  an  upper-half  processor, 
and  it  transfers  the  upper  half  of  its  data  at  this  stage. 
A  processor  whose  (i-l)s_t  bit  is  one  transfers  the  lower 
half  of  its  data  at  the  i_th  transfer  stage  [Ref.  9:  p.  74]. 
For  example,  in  Figure  2.3,  at  the  first  transfer  stage  (i=l), 
the  Qth  bits  of  processors  P~  and  P«  are  zero.   Therefore, 
P„  and  P~  are  upper-half  processors;  similarly,  P,  and  P~ 
are  lower-half  processors  at  this  stage. 

The  transfer  pattern  chosen  in  references  7  and  9 
is  not  unique,  but  leads  to  a  regular  structure  that  can  be 
described  analytically.   When  these  data  transfers  are 
represented  using  matrices,  the  "transfer  matrices"  possess 
many  interesting  properties.   As  will  be  shown  later,  the 
data  transfers  can  be  described  as  permutations  of  an  NxN 
identity  matrix.   Although  single-processor  algorithms 
generate  output  sequences  in  normal  order  when  the  input 
data  is  in  bit-reversed  order,  the  distributed-processor 
algorithm  produces  data  in  what  reference  7  refers  to  as 
"pseudo-normal"  order. 
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The  distributed  implementation  of  an  8-point  DIT  FFT  with 
bit-reversed  input  data  shown  in  Figure  2.3  can  be  described 
by  the  vector-matrix  equation: 


?3   I?   ?2   ~1   ?1 
X(k)=WWWWWx(n) 


(2.28) 


k=0, 1, . . . ,8 

n  =  0, 1 ,  ...  ,8 
where  E, ,  E„  and  E~  are  matrices  of  exponents  describing 
the  butterfly  computations  at  the  first,  second,  and  third 
stages,  respectively.   Matrices  of  this  type  are  referred 
to  as  "computation  matrices".   T,  and  T~  are  matrices 
describing  the  first  and  second  data  transfer  stages,  and 
are  called  "transfer  matrices".   To  illustrate  the  generation 
of  the  transfer  matrices,  consider  the  first  data  transfer 
in  Figure  2.3: 
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where  the  superscripts  indicate  stage  number  (prime  denoting 
a  transfer  stage),  the  subscripts  denote  the  processor  number, 
and  the  argument  specifies  the  multi-processor  memory 
location.   In  shorthand  notation, 


Ti  = 


0 

• 

0 

• 

• 

• 

0 

0 

0 

0 

• 

• 

0 

• 

0 

(2.30) 


sxnce  w   =  1  . 

Using  the  shorthand  notation  to  carry  out  the  matrix 
multiplication  of  (2.28)  gives  the  result  shown  in  Figure  2.4 
Note  the  ordering  of  the  output  sequence  X  (  k )  ,  indicated  by 
the  row  label,  k.   One-half  of  the  data  in  a  given  processor 
can  be  read  out  in  natural  order   before  it  is  necessary  to 
move  on  to  the  next  processor  to  read  the  next  element  of 
the  output  sequence.   Two  passes  through  the  processor 
network  are  thus  necessary  to  read  out  the  entire  output 
sequence.   This  is  always  true,  regardless  of  the  length 
of  the  output  sequence  or  number  of  processors  employed. 
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Figure  2.4   8-Point /4-Processor  DIT  FFT(BR  Input) 
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The  partitioned  computation  matrices  ET  are  identical  to 
the  single-processor  case  for  the  stages  prior  to  any  data 
transfers  (the  pre-transfer  butterflies).   For  the  post- 
transfer  butterfly  stages,  the  rows  of  the  computation 
matrices  have  been  permuted  vis-a-vis  their  single-processor 
counterparts.   (For  example,  compare  the  row  ordering  of 

E0  of  Figure  2.4  with  En  of  Figure  2.2.)   Notice  that  the 

*-  z  "  Z 

exponents  within  any  given  row  are  identical,  though  the 
ordering  of  the  rows  is  changed.   This  reordering  is  caused 
by  the  data  transfers,  and  will  be  examined  in  detail  in  the 
next  section.   The  entire  purpose  for  data  transfers  is  to 
get  data  points  that  need  to  be  combined  in  subsequent 
stage  butterflies  into  the  same  processor.   This  is  necessary 
when  the  memory  separation  between  points  to  be  combined  in 
a  butterfly  exceeds  that  of  the  data  stored  in  a  given 
processor.   The  most  economical  way  to  affect  the  necessary 
transfers  is  to  transfer  only  one-half  the  data  at  any  given 
transfer  stage.   There  are  several  other  constraints  on 
potential  transfer  patterns;  these  will  also  be  discussed 
in  the  next  section.   Finally,  observe  once  again  that  the 
exponent  entries  in  any  computation  matrix,  ET  ,  can  be 

'"Li 

determined  by  the  residue  of  kn  modulo  N,  where  k  is  the  row 
label,  and  n  is  the  column  label. 
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D.   PROPERTIES  OF  TRANSFER  MATRICES 

At  the  heart  of  the  multi-processor  FFT  (MPFFT) 
algorithm  are  the  inter-processor  data  transfers  that  must 
occur.   There  are  log^P  inter-processor  transfers  required 
to  compute  an  FFT  implemented  using  P  active  processors. 
When  the  transfers  are  represented  by  matrices,  the  "transfer 
matrices"  possess  many  interesting  properties.   These 
properties  will  be  examined  in  detail  in  this  section.   The 
required  data  transfers  permute  the  ordering  of  the  output 
sequence,  hence  the  rows  of  the  FFT  matrix.   When  the  FFT 
is  implemented  on  a  single  processor,  the  output  is  produced 
in  normal  order  when  the  input  data  is  in  bit-reversed  (BR) 
order.   The  bit-reversal  is  a  consequence  of  the  decimation- 
in-time  FFT  algorithm  for  computing  DFTs .   The  multi- 
processor FFT  algorithm  produces  "pseudo-normal"  (PN) 
ordered  output  when  the  input  sequence  is  in  bit-reversed 
order.   Thus,  PN  ordering  is  a  characteristic  of  MPFFT 
implementations.  This  section  will  show  how  and  why  the  PN 
ordering  comes  about . 

1  .   Transfer  Matrix  Patterns 

An  eight-port  butterfly  network  to  interconnect 
eight  processors  is  shown  in  Figure  2.5.   Reference  7 
describes  in  detail  the  topology  and  properties  of  the 
Buttterfly  Network  (BN).   The  transfers  that  take  place  on 
the  BN  are  included  in  the  vector-matrix  FFT  representation 
by  additional  matrix  factors  to  be  called  "transfer  matrices" 
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The  transfer  matrices  represent  the  input-output  connectivity 
of  the  BN ,  where  each  switch  is  regarded  as  a  two-port 
network.   The  inputs  to  the  BN  come  from  output  ports  of  the 
processors  P  ~  ,  P.,,  ...,  Pp_,  ;  the  network's  outputs  in  turn 
are  connected  to  the  input  ports  of  these  same  processors 
[Ref.  10:  p.  1].   It  is  convenient  to  view  the  transfer 
matrices  as  permutations  of  an  NxN  identity  matrix.   An 
identity  transfer  matrix  results  in  effectively  no  transfer, 
leaving  all  the  data  in  unchanged  locations.   When 
representing  the  actual  inter-processor  data  transfers  , 
elements  move  off  the  main  diagonal  in  a  very  regular  pattern. 
One-half  of  the  elements  remain  on  the  main  diagonal.   (Recall 
that  only  half  the  data  is  transferred  at  any  given  transfer 
stage.)   The  remaining  elements  move  right  (or  equivalent ly , 
down)  off  the  main  diagonal  or  left  (up)  off  the  main  diagonal 
The  elements  move  so  as  to  maintain  symmetry  about  the  main 
diagonal.   This  is  a  convenient  property,  the  consequences 
of  which  will  be  discussed  later. 

The  only  parameters  that  must  be  known  to  determine  the 
exact  form  of  a  particular  transfer  stage  matrix  are  the 
record  length,  N,  the  number  of  active  processors,  P,  and  the 
transfer  stage  number,  R.   The  number  of  active  processors 
sets  the  number  of  transfer  stages,  log„P.   The  ratio 
N/P=M  (the  size  of  the  data  block  in  each  processor)  determines 
the  number  of  elements  that  remain  on  the  main  diagonal,  M/2. 
To  illustrate  the  exact  pattern  of  a  transfer  matrix,  consider 
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a  16-point  DIT  FFT  implemented  using  4  active  processors. 
The  four-processor  system  requires  log~P=2  transfer  stages. 
The  transfer  matrices  describing  the  inter-processor  transfer 
are  shown  in  Figure  2.6.   The  NxN  matrices  are  partitioned 
into  blocks  of  dimension  MxM .   The  first  MxM  block  corresponds 
to  processor  P  n  ,  the  next  block  to  P 1  ,  etc.   Labels  are 
added  indicating  processors  corresponding  to  given  rows  and 
columns.   The  local  processor  memory  addresses  are  listed 
next  to  the  processor  addresses.   The  MxM  blocks  corresponding 
to  individual  processors  are  highlighted.   Numerical  entries 
not  falling  into  a  highlighted  processor  block  (along  the 
diagonal)  describe  the  inter-processor  transfers.   The 
destination  processor  is  indicated  by  the  element's  row 
position  and  the  source  processor  is  given  by  the  column 
position.   "Upper-half"  or  "lower-half"  processors  are 
determined  by  which  column(s)  contain  the  entries  that  have 
moved  outside  the  processor's  block.   If  the  right-most 
columns  (larger  local  memory  addresses)  corresponding  to  a 
source  processor  contain  the  transferring  entries,  the 
processor  is  an  "upper-half"  processor  at  that  stage. 
Similarly,  left-most  column  transfer  entires  correspond  to 
"lower-half"  processors.   Recall  the  determination  of 
"upper-half"  and  "lower-half"  processors  discussed  in  the 
previous  section.   Processors  with  the  same  (i-l)s_t  bit  of 
their  binary  addresses  exhibit  the  same  pattern  at  the  i th 
transfer  stage.   In  addition,  processors  that  exchange  data 
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at  the  it_h  transfer  stage  differ  only  in  the  (i-l)s_t  bit  of 
their  binary  addresses.  To  illustrate  these  ideas,  consider 
the  transfer  matrix  T,  of  Figure  2.6.   The  entries  in  rows  4 
and  5  occur  in  columns  2  and  3,  respectively,  outside  the 
4x4  block  corresponding  to  P~  .   Thus,  during  the  first 
transfer  stage,  processor  Pn  transfers  the  upper-half  (memory 
locations  2  and  3)  of  its  data  to  P,  (processor  corresponding 
to  rows  where  the  entries  appear).  Recall  that  upper  and 
lower  refer  to  the  local  memory  addresses,  and  not  to  relative 
position  in  memory.   Mote  also  that  the  Ot_h  bit  is  the  only 
bit  where  P^'s  and  Pi's  binary  addresses  differ.   (Bit  value=0 
corresponds  to  upper-half  processors,  and  a  1  indicates 
lower-half  processors.) 

The  distance  moved  off  the  main  diagonal  for  those 
elements  shifting  is  given  by  the  difference  between  single- 
processor  memory  separation  of  points  combining  in  butterflies 
and  the  separation  available  in  the  local  processors.  Points 
participating  in  a  butterfly  in  the  single-processor  DIT  FFT 
are  separated  by  1  ,  2  ,  4  ,  8  ,  .  .  .  , N /2  memory  locations  at  successive 
stages.   In  general,  the  separation  between  points  par t icipatini 


in  a  butterfly  at  stage  L  is  2 


L-l 


The  maximum  separation 


available  in  the  MPFFT  is  M/2.   Therefore,  elements  move 
(2      )  -  (M/2)  places  off  the  main  diagonal  when  representing 
the  transfer  prior  to  the  Lt_h  butterfly  stage.   In  the  16-point/ 
4-processor  example,  the  first  transfer  occurs  prior  to  the 
third  butterfly  stage.   Thus,  elements  shifting  off  the  main 
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( 3-1 ) 
diagonal   move  (2       -  (4/2))=  2  places  right  or  left/down 

or  up.   In  summary,  all  the  information  about  an 

inter-processor  transfer  stage  can  be  determined  by 
inspecting  the  corresponding  transfer  matrix. 

2  .   Algorithm  for  Generating  Transfer  Matrices 

The  regular  patterns  exhibited  by  the  transfer 
matrices  make  them  easy  to  generate.   The  row/column  positions 
can  be  expressed  as  an  ordered-pair  (i,j).   For  an  identity 
matrix  (no  transfer),  all  elements  reside  on  the  main 
diagonal  and  i=j .   When  actual  inter-processor  transfers  are 
represented,  the  column  index  is  permuted  so  that  i^j .   The 
algorithm  for  obtaining  the  column  indices  as  permutations 
of  the  row  indices  is  as  follows.   First,  the  rows  and 
columns  are  numbered  from  0  to  N-l  .   Next,  the  row  indices 
are  represented  as  binary  numbers.   The  column  indices  of 
elements  descri bing inter-processor  transfers  are  obtained 
by  switching  2  appropriate  bits  in  the  binary  representation 
of  the  row  indices.   When  computing, an  N-point  MPFFT,  the 
matrices  are  of  dimension  NxN.   Thus,  there  are  log„N  bits 
necessary  to  represent  the  row/column  indices  in  binary. 
Label  these  bits  b,  , b.  .,...,  b.  ,  bfi,  where  L=( log2N ) -1 .   The 
bit  whose  label  is  (L*-l)  is  switched  with  the  bit  whose 
label  depends  on  the  transfer  stage  number.   ( L'::"  =  log^M=number 
of  pre-transfer  butterfly  stages.)   At  the  first  transfer 
stage,  for  example,  bit  number  (L"x"-1)  is  switched  with  the 
bit  indexed  one  number  greater.   At  the  second  transfer 
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stage,  bit  (L*-l)  is  switched  with  the  bit  whose  label  is  two 
greater,  etc.   This  procedure  generates  the  column  position, 
j,  in  the  ordered-pair  (i,j)  from  the  row  position.   For 
example,  an  8-point  DIT  MPFFT  implemented  on  4  processors 
requires  3  bits  to  represent  the  row/column  indices 
0,1,... ,7.   These  bits  are  labelled  b?,b-,,bn.   Then 
L*=log^M=log9 ( N/P)=l .   To  determine  the  column  positions 
of  the  transfer  matrix  entries,  the  row  indices  are  permuted 
as  described  above.   That  is,  for  the  first  transfer  stage, 
bits  bn(L*-l)  and  b,  are  switched.   At  the  second  transfer 
stage,  bits  bn  and  b„  are  switched.   Thus,  the  transfer 
matrices  for  this  example  are  obtained  a  shown  in  Figure  2.7. 
Note  that  switching  the  appropriate  two  bits  at  a  given 
stage  has  no  effect  on  the  elements  that  remain  on  the  main 
diagonal.   For  elements  not  transferring,  the  two  appropriate 
bits  are  the  same  for  the  row  index  at  the  stage  of  interest. 
Consider  now  the  powers  of  2  that  the  bits  represent.   The 
distance  shifted  off  the  main  diagonal  is  the  difference 
represented  by  the  bit-switching  operation.   For  example, 
when  switching  b„  and  b. ,  the  difference  is  2  -2  =1.   At 
this  transfer  stage,  elements  shifting  move  one  place  off 
the  main  diagonal.   This  is  simply  the  difference  between  i 
and  j  when  i^j .   If  this  algorithm  results  in  a  larger 
column  index  ((i,j)  j  i),  the  element  in  the  transfer 
matrix  shifts  right  off  the  main  diagonal.   If  the  row 
index  permutation  results  in  i>j ,  the  corresponding  element 
shifts  left.   Right  shifts  off  the  main  diagonal  correspond 
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to  "upper-half"  processors,  and  left  shifts  correspond  to 
"lower-half"  processors.   The  bit-switching  algorithm  holds 
for  any  number  of  points  and  any  number  of  processors.   It 
always  involves  switching  the  bit  labelled  (L";:"-L)  with  a 
bit  determined  by  the  transfer  stage  number.   Although  this 
algorithm  works  for  either  DIT  or  DIF  MPFFTs  with  bit-reversed 
input  data  sequences ,  it  must  be  slightly  modified  to 
accommodate  pseudo-normal  input.   This  modification  will  be 
developed  in  a  later  section. 

3  .   Orthogonality  of  Transfer  Matrices 

All  transfer  matrices  are  symmetric.   At  each 
transfer  stage,  each  processor  retains  half  its  data.   The 
corresponding  elements  of  the  transfer  matrix  remain  on  the 
main  diagonal  .   Each  processor  that  transfers  the  upper-half 
of  its  data  (transfer  matrix  elements  shifting  right)  receives 
data  from  another  processor's   lower  half  (elements  shifting 
left).   At  each  stage,  elements  shifting  off  the  main  diagonal 
all  move  the  same  number  of  places.   This  results  in  a  symmetric 
permutation  of  the  NxN  identity  matrix. 

Recall  that  entries  in  the  transfer  matrices  only 
take  on  numerical  values  of  1(=W  )  or  0(.=no  entry).   This 
property,  coupled  with  the  symmetry  property,  results  in 
all  transfer  matrices  being  orthogonal.   That  is, 


(w~V  =  (wtR)_1 


R=l,2,...,log2P 


(2.32) 
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Equivalently , 


wTr-(wTr)t  =  I 


(2.33) 


Or,  in  the  shorthand  notation, 


TR  *  T  Rl    =  diag  [0] 


(2.34) 


For  example,  consider  again  T\  of  the  8-point . 4-processor 

DIT  MPFFT.   Substituting  into  (2.33)  gives): 

T       T 

W  l  •  (W  L)       = 


10000000 

0  0    10    0    0    0    0 

0  10    0    0    0    0    0 

0  0    0    10    0    0    0 

0  0    0    0    10    0    0 

0  0    0    0    0    0    10 

0  0    0    0    0    10    0 

0  0    0    0    0    0    0    1 


10  0  0  0  0  0  0 

0  0  10  0  0  0  0 

0  10  0  0  0  0  0 

0  0  0  10  0  0  0 

0  0  0  0  10  0  0 

0  0  0  0  0  0  10 

0  0  0  0  0  10  0 

0  0  0  0  0  0  0  1 


10  0  0  0  0  0  0 

0  10  0  0  0  0  0 

0  0  10  0  0  0  0 

0  0  0  10  0  0  0 

0  0  0  0  10  0  0 

0  0  0  0  0  10  0 

0  0  0  0  0  0  10 

0  0  0  0  0  0  0  1 


Using    shorthand    notation,     (2.34)    gives: 


=    I 


(2.35) 


0 


0 


0 


0 


0 


0 


(2.36) 
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These  convenient  properties  will  be  put  to  use  later.  For 
now,  it  is  sufficient  to  note  that  all  transfer  matrices, 
are  orthogonal  . 

4 .   Pseudo-Normal  Ordering 

It  was  previously  shown  that  single-processor  FFT 
algorithms  can  be  viewed  as  row  or  column  permutations  of  a 
DFT  matrix.  This  section  will  demonstrate  that  multi- 
processor FFT  implementations  can  be  viewed  as  further 
row/column  permutations  of  the  same  basic  DFT  matrix.   The 
additional  row/column  permutations  arise  from  the  inter- 
processor  data  transfers  required  in  the  MPFFT  algorithm. 
The  inter-processor  transfers  also  cause  a  reordering  of  the 
data  sequences  involved  in  the  MPFFT. 

Inter-processor  data  transfers  arise  from  the  need 
to  move  partially-processed  data  points  that  will  be 
combined  in  succeeding  butterfly  stages  in  the  same  processor 
When  computing  a  DIT  MPFFT,  this  permutes  the  rows  of 
succeeding  computation  matrix  factors.   It  will  be  shown 
later  that  the  inter-processor  transfers  similarly  reorder 
the  columns  of  the  decimation-in-f requency  (DIF)  computation 
matrices.   The  exponent  weighting  factors  within  a  given  row 
or  column  remain  unchanged;  only  the  row  or  column  ordering 
is  changed.   As  would  be  expected,  the  row/column 
permutations  also  exhibit  a  regular  pattern,  based  on  the 
pattern  of  the  transfer  matrices. 
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The  rows  and  columns  of  the  matrix  factors  are 
labelled  with  values  of  k  and  n,  respectively.   The  column 
labels,  n,  correspond  to  the  ordering  of  the  input  data 
sequence,  x(n).   For  now,  only  bit-reversed  (BR)  input 
sequences  will  be  considered.   The  row  labels,  k,  represent 
the  ordering  of  the  output  sequence,  X(k).   Starting  with 
the  normal-ordered  output  that  is  produced  by  the  SPFFT 
algorithm  using  BR  input,  this  section  describes  how  the 
normal  ordering  is  transformed  into  "pseudo-normal"  (PN) 
ordering  by  the  inter-procesosr  transfers  on  the  Butterfly 
Network. 

Consider  a  16-point  DIT  MPFFT  implemented  on  an 
8-processor  Butterfly  Network  (BN).   Three  transfer  stages 
(log?8)  are  necessary  in  this  computation.   The  row  labels, 
k,  undergo  the  following  stage-by-stage  permutations: 
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Figure  2.8  Normal  to  Pseudo-normal  Ordering 
(16  points/8  processors) 


Figure  2.8   illustrates  which  data  points  are  combined  in  the 
butterfly  computations  at  each  stage.   During  the  first 
(and  only)  pre-transfer  butterfly  stage,  points  x(0)  and 

x(l)  are  combined  in  a  butterfly  in  processor  P~.   The  second 

(  2-1 ) 
butterfly  stage  requires  separation  between  points  of  2      =2 

memory  locations.  This  separation  is  not  available  within  a 

given  processor,  so  an  inter-processor  transfer  must  occur. 

After  the  first  transfer,  partially-processed  data  points 

x(0)  and  x(2)  reside  together  in  P~  ,  and  another  butterfly 

computation  is  performed.   This  procedure  continues  until 

the  computation  is  completed.   The  PN  output  sequence  X(k) 

that  results  is  in  the  order  shown  after  the  final  transfer 

stage,  T~.   This  pseudo-normal  ordering  depends  only  on  the 

record  length,  N,  number  of  active  processors,  P,  and  their 

ratio,  M=N/P.   The  record  length  determines  the  total 

number  of  butterfly  stages,  log2N.   The  number  of  transfer 

stages  (log?P)  is  set  by  the  number  of  processors.   The  size 

of  the  data  block  in  each  processor,  M,  sets  the  number  of 

pre-transfer  butterfly  stages  (log2M=L#)  that  can  be  computed 

prior  to  any  transfers.   The  PN  ordering  is  a  function  of  the 
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number  of  transfers,  hence  processors.   This  is  analogous  to 
the  SPFFT  case,  where  bit-reversal  is  dependent  on  the  record 

length,  N.   Note  also  from  Figure  2.8  that  one-half  the  points 
in  each  processor  are  output  in  normal  order  ,  after  the  final 
transfer.   Thus,  the  output  sequence  can  be  rearranged  by 
additional  network  transfers,  or  the  processor  local  memories 
can  be  read  cyclically  to  obtain  the  output  data  in  normal 
order  [Ref.  7:  p.  18]. 

The  reordering  of  the  output  sequence,  from  normal 
to  PN  ordering,  can  be  derived  by  a  bit  switching  algorithm 
similar  to  that  for  generating  transfer  matrix  patterns. 
The  bits  to  be  switched  at  each  stage  are  different  in  this 
case,  however.   The  binary  representation  of  the  row  labels, 
k ,  again  requires  log0N  bits  to  describe  the  N-point  output 
sequence.   The  bits  switched  this  time  always  involve  two 
adjacent  bits,  beginning  again  with  the  bit  labelled  (L";c"-1) 
at  the  first  transfer  stage.   The  labels  of  the  bits  to  be 
switched  are  then  each  incremented  by  one  at  each  successive 
transfer  stage.   Figure  2.9  illustrates  the  algorithm  for  2 
different  values  of  P  and  M.   Note  in  Figure  2.9  that  only  M/2 
values  of  k  in  each  processor  are  changed  at  any  given  stage. 
Also  observe  that  the  starting  bit  to  be  switched  at  the 
first  transfer  stage  is  different  for  different  values  of  M. 
This  is  because  fewer  active  processors  compute  more  pre- 
transfer  butterfly  stages,  and  require  fewer  transfers  (less 
reordering).   Consider  again  the  powers  of  2  represented  by 
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Figure  2.9a   Normal  to  Pseudo-Normal  Ordering 

16-points/8-processors  DIT  MPFFT  (L*-1)=0 
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the  bits.   Since  the  first  transfer  in  Figure  2.9a  switches 
bits  b„  and  b,  ,  the  k  values  change  by  2  -2  =1.   This 
explains  why  the  bits  to  be  switched  are  adjacent  in  this 
case.   During  each  transfer,  the  network  is  preparing  for 
butterflies  requiring  separation  between  points  twice  that 
available.   Thus*  bits  representing  successively  higher 
powers  of  2  are  switched  at  succeeding  stages. 
5 .   Selection  of  Transfer  Matrices 

The  transfer  patterns  used  in  references  7  and  9 
are  not  unique.   Other  patterns  of  transfers  were  investigated, 
but  had  the  disadvantage  of  structures  for  which  the  general 
pattern  could  not  be  easily  discerned  and  described  analytically 
[Ref.  9:  p.  74].   As  would  be  expected,  the  same  situation 
exists  regarding  the  transfer  matrix  factors  describing  dis- 
tributed FFTs  on  the  BN .   This  section  discusses  the 
selection  of  appropriate  transfer  matrices  to  represent  the 
inter-processor  transfers  on  the  BN . 

Several  constraints  on  the  make-up  of  potential 
transfer  matrices  can  be  identified.   These  constraints  are: 

1)  Entries  take  on  values  of  1  or  0  only. 

2)  Only  one  entry  per  row  or  column  is  permitted. 

3)  The  toplogy  of  the  Butterfly  Network  must  be  satisfied. 

4)  Data  points  to  be  combined  in  the  next  butterfly  stage 
must  reside  in  the  same  processor  after  transer. 

5)  One-half  the  entries  remain  on  the  main  diagonal. 


54 


These  constraints  greatly  reduce  the  number  of  NxN  matrices 
that  qualify  as  transfer  matrices.   The  constraints  also 
lead  to  the  transfer  matrices  being  symmetric,  hence 
orthogonal.   This  convenient  property  will  be  exploited  in 
the  next  section. 

Beginning  with  the  NxN  identity  matrix,  the  elements 
are  permuted  according  to  the  constraints  listed  above  to 
determine  "legal"  transfer  patterns.   For  the  8-point/ 
4-processor  DIT  MPFFT ,  this  procedure  yields  only  two 
matrices  for  each  transfer  stage  that  meet  all  the  constraints. 
These  matrices  are  shown  in  Figure  2.10.   The  transfer  matrices 
actually  used  are  shown  in  Figure  2.10a.   The  other  matrices 
that  satisfy  all  the  constraints  are  shown  in  Figure  2.10b. 
When  the  transfer  patterns  of  Figure  2.10b  are  used,  two 
undesirable  properties  result.   First,  since  the  higher- 
numbered  data  points  occupy  the  lower  addresses  in  each 
processor  memory,  the  butterflies  that  follow  must  be 
inverted,  with  the  twiddle  factor  paterns  upside-down. 
Additionally,  following  the  second  transfer,  the  weighting 
factor  patern  is  scrambled  from  the  order  expected.   The 
second  undesirable  result  involves  the  ordering  of  the 
output  sequence.   Instead  of  normal,  bit-reversed,  or 
pseudo-normal  ordering,  all  of  which  can  be  unravelled,  the 
output  is  in  indecipherable  order:  X(5)  X(l)  X(4)  X(0)  X(7) 
X(3)  X(6)  X(2).   In  addition  to  meeting  all  the  constraints, 
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the  transfer  patterns  used  in  reference  7  also  lead  to 
a  consistent  pattern  in  the  twiddle  factors  in  every 
butterfly  stage.   The  PN  ordering  of  the  output  sequence 
is  easy  to  describe  and  unravel.   Note  also  that  the 
alternate  transfer  matrices  are  more  difficult  to  generate  . 
That  is,  the  bit-switching  algorithm  presented  in  Section 
II. D.  2  does  not  work  for  the  alternate  transfer  matrices. 

E.   GENERALIZED  MULTI-PROCESSOR  FFTs 

Single-processor  FFT  (SPFFT)  algorithms  permute  the 
row  or  column  ordering  of  the  DFT  matrix  which  they  imple- 
ment.  This  is  turn  reorders  the  output  (permuted  rows)  or 
input  (permuted  columns)  data  sequence.   The  reordering 
caused  by  SFFT  algorithms  changes  normal-ordered  input  or 
output  sequences  into  bit-reversed  (BR)  order.   The  multi- 
processor implementation  of  the  decimation-in-time  (DIT) 
FFT  algorithm  described  in  Section  II. D. 4  has  been  shown 
to  further  reorder  the  rows  of  the  basic  DFT  matrix.   This 
further  permutes  the  ordering  of  the  transformed  output 
sequence.   As  was  shown,  when  using  bit-reversed  input  data, 
the DIT  multi-processor  FFT  produces  a  pseudo-normal  (PN) 
ordered  output  sequence.   Pseudo-normal  ordering  is  thus  a 
consequence  of  this  multi-processor  FFT  (MPFFT)  algorithm, 
as  bit-reversed  ordering  is  associated  with  SPFFT 
algorithms.  This  section  describes  the  generalized 
factorization  of  MPFFT  matrices  to  obtain  the  stage-by-stage 
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matrix  factors.   The  DIT  MPFFT  (BR  input/PN  output)  examined 
previously  will  be  extended  to  represent  the  DIT  MPFFT 
with  PN  input  ordering,  and  to  decimation-in-f requency  (DIF) 
MPFFT  algorithms  using  both  BR-  and  PN-ordered  input 
sequences  . 

The  overall  MPFFT  matrix  of  twiddle  factor  exponents, 
E  ,   is  obtained  by  forming  the  modulo  N  product  of  the 
appropriately  ordered  row  and  column  labels  [Ref.  2:  p.  43]. 
The  row  and  column  labels  represent  the  ordering  of  the 
output  and  input  sequences,  respectively.   A  BR  input 
sequence  to  the  MPFFT  implemented  using  the  Butterfly 
Network  produces  PN-ordered  output.   Similarly,  a  PN  input 
data  sequence  produces  transformed  output  in  BR  order.   This 
is  true  whether  a  DIT  or  DIF  MPFFT  is  being  computed, 
although  the  matrix  factors  representing  individual  computa- 
tion stages  differ.   Thus,  after  calculating  the  BR  and  PN 
sequence  ordering  for  the  given  number  of  data  points  and 
active  processors,  it  is  easy  to  write  down  the  overall 
MPFFT  matrix. 

Once  the  overall  MPFFT  matrix  is  obtained,  it  must 
be  factored  to  represent  the  individual  computation  stages. 
As  discussed  previously,  an  N-point  FFT  requires  a  total 
of  log^N  butterfly  stages.   When  implemented  on  the  multi- 
processor Butterfly  Network,  there  are  L*=log9M  pre-transfer 
butterfly  stages.   (Recall  that  N=record  length,  P=number 
of  active  processors,  and  M=N/P=number  of  points  in  each 
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processor.)   Therefore,  each  computation  stage  is  represented 
by  one  matrix  factor,  with  one  transfer  matrix  prior  to  each 
post-transfer  butterfly  computation  matrix  factor.   Each 
matrix  factor  successively  pre-multiplies  the  factor(s) 
representing  the  previous  stage(s)  when  the  matrix  product 

is  carried  out.   Thus,  the  general  form  of  the  MPFFT  is: 

T    E         E 
x(k)=wElogNwTlogPwE(logN)-l  wT(logP)-l...W  l    W  l0gM...W^ l    x(n) 


post- transfer  butterflies  and 
transfers 


pre-transrer 
butterflies 


(2.37) 
where  all  logarithms  in  the  matrix  indices  are  base  two. 
The  remainder  of  this  section  discusses  how  to  form  each 
of  the  matrix  factors  in  (2.37). 

1 .   PIT  MPFFT  with  Bit-Reversed  Input 

The  MPFFT  matrix,  E  ,   is  formed  by  reordering  the 
row  and  column  indices  as  described  previously,  and  then 
forming  the  modulo  N  products  to  give  the  individual 
elements.   This  matrix  must  then  be  factored  to  obtain 
the  representation  of  each  computation  and  transfer  stage. 
The  most  illustrative  case  to  consider  is  when  L"""  =  log^M=l  . 
In  this  case,  there  is  only  one  pre-transfer  butterfly 
stage,  and  the  maximum  number  of  transfers  and  post-transfer 
butterfly  stages.   Each  processor  contains  M=2  data  points, 
and  is  thus  capable  of  performing  a  2-point  DFT  internally. 
At  successive  stages,  each  processor  can  compute  one-half 
of  a  4-point  DFT,  one-fourth  of  an  8-point  DFT,  etc., 
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provided  that  appropriate  data  points  are  transferred  to 

reside  within  each  individual  processor.   Other  situations 

involving  more  than  2  points  per  processor  are  easily 

described  in  relation  to  the  L*=l  case. 

Each  NxN  matrix  factor  in  (2.37)  is  first  partitioned 

into  blocks  of  dimension  MxM.   Each  MxM  block  along  the  main 

diagonal  represents  an  individual  processor.   The  computation 

stage  matrices  ET  contain  entries  only  in  the  submatrix 

blocks  along  the  diagonal.   The  entries  in  these  blocks 

describe  the  butterfly  computations  taking  place  within  the 

individual  processors.   The  transfer  matrices  T   contain 

~  R 

entries  outside  the  blocks  along  the  diagonal  to  describe 
the  inter-processor  data  transfers. 

Consider  once  again  the  8-point/4-processor  DIT 
MPFFT  using  BR  input  of  Figure  2.4.   Note  that  there  are  a 
total  of  3(log98)  computation  matrix  factors.   Ji,  represents 
the  first  ( pre-transf er )  butterfly  stage,  and  there  are 
2  (log?4)  matrix  factors  representing  the  post-transfer 
butterfly  stages,  each  preceded  by  a  transfer  matrix. 
Recall  that  the  DIT  algorithm  decomposes  an  8-point  DFT  into 
two  4-point  DFTs ,  then  into  four  2-point  DFTs .   See  Figure  2.11 
This  decomposition  partitions  the  input  time  sequence  into 
subsequences  having  even  and  odd  indices.   At  each 
successive  decomposition,  the  butterfly  inputs  are  separated 
by  larger  increments  in  time.   The  first  butterfly  stage 
combines  samples  from  the  time  sequence  separated  by  4 
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units  (N/2).   Thus,  the  first  matrix  factor ,  E,  ,  has  column 
(time)  indices  0  and  4.   The  ordering  of  the  row  indices,  k, 
at  the  first  stage  is  the  normal  ordering  characteristic 
of  the  single-processor  DIT  FFT  with  bit-reversed  input  data 
The  row  ordering  is  then  permuted  by  each  inter-processor 
transfer  stage  until  the  output  sequence  ends  up  in  the  PN 
ordering  characteristic  of  the  MPFFT  on  the  Butterfly 
Network.   The  second  butterfly  stage  in  the  DIT  FFT  combines 
points  separated  by  two  memory  locations.  In  this  example, 
this  separation  is  not  available  within  a  given  processor, 
so  an  inter-processor  transfer  must  take  place.   This 
reorders  the  rows  of  the  next  computaton  stage,  and  all 
successive  stages.   The  algorithm  for  reordering  the  row 
indices  was  presented  in  Section  II. D. 4,   The  time 
separation  between  points  combining  in  the  second  butterfly 
stage  is  2  units,  so  the  column  labels  for  this  matrix 
factor  (E0)  are  0  and  2.   Note  that  each  twiddle  factor 
exponent  in  the  blocks  along  the  diagonal  is  still  obtained 
by  the  modulo  N  product  of  the  row  and  column  indices.   The 
third  (final)  computation  stage  marix  factor,  E~  ,  is  derived 
similarly.   The  row  ordering  is  further  permuted  by  the 
second  inter-processor  transfer.   The  row  indices  of  the 
final  computation  stage  are  now  in  the  PN  order  in  which 
the  transformed  output  sequence  appears.   The  column  labels 
are  0  and  1  ,  corresponding  to  the  butterfly  inputs  ' 
separation  of  1  unit.   It  can  be  verified  by  comparing 
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Figure  2.4  with  Figure  2.2  that  identical  butterfly  computa- 
tions take  place  at  each  stage.   That  is,  each  row  in  the 
computation  matrix  factors  contains  identical  entries  at 
any  stage,  only  the  ordering  of  the  rows  is  changed  in  the 
MPFFT  by  the  inter-processor  transfers. 

An  explanation  of  the  coiumn  labels  of  the  transfer 
matrices  is  also  necessary.   A  column  label  of  0  indicates 
that  the  element  in  that  column  remains  on  the  main  diagonal 
(doesn't  represent  a  transfer  at  that  stage).   The  negatively- 
valued  labels  indicate  elements  moving  down  off  the  main 
diagonal,  and  correspond  to  lower-half  processors  at  the 
transfer  stage  in  question.   Similarly,  positively-valued 
labels  indicate  elements  moving  up  from  the  main  diagonal 
(upper-half  processors).   The  magnitude  of  the  transfer 
matrices  '  column  labels  in  the  DIT  algorithm  corresponds  to 
the  time  separation  between  points  that  are  to  be  combined 
in  the  butterflies  at  the  next  computation  stage.   The  row 
indices  are  permuted  by  previous  transfers  the  same  as  the 
two  indices  of  the  computation  matrix  factors. 

The  matrix  factors  in  (2.37)  are  called  sparse 
matrices  because  of  the  numerous  zero  entries  in  any  row 
or  column.   The  sparse  matrices  cut  down  the  number  of 
arithmetic  operations  required  to  compute  the  DFT  [Ref  .  2: 
p.  67].   Multiplying  out  the  matrix  factors  gives  the 
permuted  DFT  matrix  E  as  the  result.   Taking  the  products 
can  be  regarded  as  a  mixing  operation  in  which  the  column 


63 


time  indices  of  the  matrix  factors  add.   Successive 
decomposition  of  the  input  time  sequence  results  in  E, 
having  only  time  values  of  0  and  4  units,  E«  having  only 
time  values  of  0  and  2,  and  E~  having  only  time  values  of 
0  and  1.   When  the  matrix  factors  are  multiplied,  the  time 
values  mix  so  that  E„";:"T^';:"E„  has  time  values  of  0,4,2,  and  6 
units.  Similarly,  in  taking  (  E  „  * T  ~  * E  ~  )  * T 1  *  E 1  ,  the  mixing 
results  in  the  product  having  time  values  of  0,4,2,6,1,5,3,7 
This  represents  the  ordering  of  the  input  sequence  which  was 
successively  decomposed  at  each  stage  in  the  factorization 
of  E   to  give  the  DIT  algorithm.   Thus,  carrying  out  the 
DIT  MPFFT  (multiplying  the  matrix  factors)  can  be  viewed 
as  a  recombination  of  the  input  sequence  that  was  decomposed 
in  setting  up  the  algorithm.   The  row  indices  of  the  E 
matrix  are  determined  by  the  row  ordering  at  the  final 
computation  stage,  and   represent  the  ordering  of  the 
transformed  output  sequence,  X(k). 

The  8-point /4-processor  example  (L";c~=l)  can  be 
extended  in  several  ways  as  follows.   Since  M=N/P,  there  are 
really  only  two  variables  in  (2.37),  the  record  length,  N, 
and  the  number  of  active  processors,  P.   The  total  number 
of  butterfly  stages  must  equal  the  number  of  pre-transfer 
and  post-transfer  butterfly  stages.   Therefore: 


log2N  =  log2M+log2P  =  L*  +  log2P 


(2.38) 
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Two   situations  can  result  in  fewer  transfer  stages  than  the 
L*=l  example  just  cited.   Extending  the  input  record 
length  to  a  higher  power  of  2  while  keeping  the  number  of 
processors  constant  results  in  one  or  more  additional  pre- 
transfer  butterfly  stages.  This  situation  will  also  increase 
the  total  number  of  butterfly  stages,  log?N.   Also,  decreasing 
the  number  of  active  processors  (again,  required  to  be  a 
power  of  2)  for  a  constant  input  sequence  length  results 
in  more  pre-transfer  butterfly  stages.   This  situation  can 
result  when  processors  suffer  faults,  and  there  are  no  spares, 
and  leaves  the  total  number  of  butterfly  stages  unchanged. 
Both  of  these  situations  increase  the  number  of  pre-transfer 
butterfly  stages  by  increasing  the  size  of  the  data  block 
within  each  processor,  M.   This  makes  each  processor  capable 
of  performing  more  than  one  pre-transfer  butterfly  stage. 
To  illustrate,  consider  again  the  8-point  DIT  MPFFT,  this 
time  using  only  2  processors,  as  shown  in  Figure  2.12. 
There  is  still  a  total  of  3  (log«8)  butterfly  stages.   Now, 
however,  there  are  2  (log„(8/2))  pre-transfer  butterfly 
stages  ,  and  only  1  transfer  stage  and  1  post-transfer 
butterfly  stage.   Referring  to  the  algorithm  of  Section 
II.  D.  4  to  calculate  the  PN  ordering  for  L"x"  =  2  ,  the  E  matrix 
can  be  written  down.   Each  of  the  matrix  factors  ET  is 
divided  into  submatrix  blocks  of  dimension  MxM.   Now, 
however,  the  blcoks  will  be  4x4.   Each  processor  contains 
enough  data  to  compute  2  pre-transfer  butterfly  stages. 
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Figure  2.12   8-Point /2-Processor  DIT  FFT  (BR  Input) 


66 


Since  during  the  pre-transfer  stage(s)  each  processor  is 
functioning  independently  (i.e.,  as  a  single-processor), 
the  pre-transfer  matrices  are  identical  to  the  SPFFT  case. 
Note  also  that  all  processors  contain  the  same  weighting 
factor  pattern  prior  to  any  transers,  regardless  at  which 
stage  the  first  transfer  occurs. 

When  the  submatrix  blocks  representing  the 
individual  processors  are  of  dimension  4x4  or  greater, 
there  are  dots,  representing  no  entry,  within  each  block. 
This  is  analagous  to  the  single-processor  case,  in  that 
some  or  all  of  the  required  memory  separation  between 
points  is  available  within  the  processors'  local  memories. 
When  the  stage  is  reached  where  inter-processor  transfers 
are  necessary,  rows  of  the  succeeding  stage  matrices  are 
permuted,  as  always.   The  column  labels  representing 
time  values  of  each  matrix  factor  are  such  that  each 
processor  block  contains  an  equal  number  of  zero  and  non-zero 
time  labels.   Regarding  the  transfer  matrices,  one  or  several 
of  the  transfer  stages  required  in  the  L*=l  case  are 
"bypassed"  as  the  N/P=M  ratio  increases.   The  algorithm 
for  generating  the  transfer  matrices  presented  in  Section 
II. D. 2  still  applies,  but  the  number  represented  by  L"x"-1 
(first  bit  to  be  switched)  increases. 
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2.   PIT  MPFFT  with  PN  Input 

The  DIT  FFT  algorithm  can  be  used  on  a  single 
processor  with  normal-ordered  input,  to  produce  bit- 
reversed  output.   Similarly,  the  DIT  multi-processor 
algorithm  can  also  be  used  with  a  pseudo-normal  input 
sequence.   As  would  be  expected,  the  transformed  output  is 
produced  in  bit-reversed  order.   The  MPFFT  matrix  E   is  the 
transpose  of  the  E   matrix  of  the  BR  input  case.   This 
reflects  the  change  in  th  input-output  relationship  of 
the  PN-  and  BR-ordered  sequences.   The  elements  of  the  E 
matrix  are  thus  still  obtained  from  the  modulo  N  product 
of  the  row  and  column  indices  representing  the  output  and 
input  sequence  ordering.   Unfortunately,  the  matrix  factors 
describing  individual  stages  are  not  simply  the  transposes  of 
the  matrix  factors  from  the  BR  input  algorithm.   This  is 
due  to  the  different  ordering  of  the  input  sequence.   The 
inter-processor  transfers  occur  in  reverse  order,  and  for 
the  case  where  L";:"^l  (fewer  than  maximum  number  of  transfers), 
even  occur  at  different  stages.   Specifically,  for  an 
8-point /2-processor  DIT  algorithm,  L*=log ~M=2 .   When  using 
PN  input  data,  there  is  one  transfer  stage  (log„P)  as 
expected.   However,  the  one  transfer  is  required  after  the 
first  butterfly  stage,  and  not  after  two  pre-transfer 
stages,  as  in  the  BR  input  case.   There  are  two  post-transfer 
butterfly  stages,  but  only  one  transfer.   That  is,  there  is 
no  second  transfer  required  between  the  two  post-transfer 
butterfly  stages.   (Compare  Figure  2.13  with  Figure  2.12). 
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Figure  2.13   8-Point /2-Processor  DIT  FFT  (PN  Input) 
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Again,  the  easiest  case  to  consider  is  when  L"x*=l  . 
There  is  only  one  pre-transfer  butterfly  stage,  and  each 
post-transfer  stage  is  preceded  by  an  inter-processor 
transfer  stage.   In  this  case,  though,  since  the  Butterfly 
Network  is  transforming  PN  input  into  BR-ordered  output  , 
the  transfer  pattern  is  the  reverse  of  that  in  the  BR 
input  case.   The  individual  transfer  matrices  are  the  same, 
since  they  are  symmetric,  but  they  occur  in  reverse  order. 
(Compare  T,  of  Figure  2.1.4  and  T2  of  Figure  2.4.)   The  column 
labels  once  again  correspond  to  the  time  decomposition  of 
the  input  sequence.   For  an  8-point  DIT,  the  labels  are 
0  and  4  at  the  first  stage,  0  and  2  at  the  second  stage 
and  0  and  1  at  the  final  stage.   The  column  indices  of  the 
transfer  matrices  are  also  the  same  as  before.   The  non-zero 
magnitudes  still  indicate  time  separation  at  the  next 
butterfly  stage.  Values  of  zero  indicate  non-transferring 
elements,  and  the  sign  gives  direction  of  movement  off  the 
main  diagonal.   The  row  indices  for  the  first  computation 
matrix  factor  are  in  what  will  be  called  "pseudo-bit- 
reversed"  (PBR)  order.   This  is  the  output  ordering  that 
would  result  in  a  SPFFT  using  pseudo-normal-ordered  input. 
Recall  that  the  exact  PN  ordering  is  a  function  of  the 
number  of  required  network  transfers,  log„P.   The  same 
is  true  of  PBR  ordering,  since  the  transfers  ultimately 
permute  the  PBR  ordering  into  bit-reversed-ordered 
output.   Thus,  the  elements  within  each  matrix  factor  can 

70 


TV 


E. 


T1'-E1 


1^.0    -2,0   -2,2   0      2   0_         ky_0   4  (    0   4  f  0   4  (  0   4  ky_0   2,02^.4,24 

0   0  ' 


0 

0 

! 

i 
i  • 

1 

•  1 

1 

"   J_ 

i° 

1 

4 

1   0 

.  i 
i 

1  •    • 

5 

•  i 

|o  ■ 

2 

°T 

i  • 

1 
1 

3 

— j — 

i  . 
-—i — 

0  1 

6 

0   1 

1  •    • 

7 

.  1 

1 

|    .    0 

0    4 


0   0  I 

I 

0    4  | 

1 T 


,   °    ° 

I   0    4 

•  I 


0   0 

0    4 


0   0 


1 


0    4 


0    0 


0    0 

I 

i T 


L— -r2J.j 


0    4  I 


0    0 


0    4 


Ii'-«i 


B2*Ii,*B1 


RyO   2      0   2      0    2      0   2  £?_0   2      0   2      2   4      2   4  kyO   4      0   4      2    5      2   6 


0    0   I 

0    4   ! 


I 
I 

0    0    I 

0    4    i 


I 

I 

,J_. 


10    5     | 

i r 


i  o  2 

0    6 


0    0 


0    4 


I  0   0 

I 

.+ — I 


I 

,1. 


0    4 


0    0 


0    4 


0    0 


0    4 


0   0   j 

0    0   I 

1- 


0   0 


r 


- 


0    4   I 
0    4    i 


0   0 

a 1 


.-I 


[  0    4 
I   0   4 


k\_0   -110 


^2 


E2  -T1'.E1 


0   0 

4    4 


0   0 
4   4 


2    6   I 

I 
5   2  i 


2   6 
6   2 


I    0 


:j 


T' 


_^__0 


I 
.1. 


0    -3,    1    0  ky?   4      0   4      2   6      2   6  lc\_0    3 


T 


0     . 


0   0 
0   0 


,  °  ° 

I  4    4 

t- t 

o  o  I 


0   4 
0    4 


'"I 


0   0 


I  2   6 
|_6_2 


0    4 


?3 


0    4  I 

i  i 

T0*  *  E 


2*±1 


lk_0    1      0    1.01.01  lc\_0   3  ,    14,25 


0   0  i 
0   4  ' 

I  0    2 


0    6 


0    1   I 


0   3 


I  0   7 


0   0 


0   0 


I  I 

I    0_0  i 

0   o|  ^44 

1    0   0  I 
1---  + 1 

I  2    6 
I 


0    4 
0   4 


J_°_1l 


I 

I    0    4  I 


6    2 


0   0 
4   4 


2    6 
6   2 

'  -eT 


T2'*E2,T1'*E1 
14      2   5      3   6 


0   0 


I 

0   0 


0    4 


0    4 


0   0 


J L 

4    4  I 


0   0 


0    4 


0    4 


I 
0   0   I 


0   0 


2    6 


I  4   4 

'1 


2   6 


6    2 


i  6   2_1 


3   6  i^_0   4,15      26,37 


0   0 


4   4 


2_6 
6   2 


0   0      0   0 

0   0  14   4 

1 

0   0,    2   2 
0   Ol    6   6 


0   4115 
j_  0  _4_{_  _5_  1 

0   4J    3   7 
0   417   3 


0   0 

_0_0 

4    4 

4    4 


2   6 
2    6 


6   2 
6   2 


0   0 

4_4 

6  6 
2_2 

3   7 

-    u  I    7   3 


1   5 
5   1 


(2.41) 

Figure  2.14   8-Point /4-Processor  DIT  FFT  (PN  Input) 
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still  be  calculated  from  the  modulo  N  product  of  the  row 
and  column  indices.   The  row  ordering  is  permuted  by  each 
successive  transer  ,  as  before.   At  the  final  computation 
stage,  the  row  indices,  k,  are  in  bit-reversed  order 
corresponding  to  the  ordering  of  the  transformed  output 
sequence.   The  column  labels  of  the  E   matrix  can  again 
be  found  by  summing  the  time  indices  of  each  of  the 
individual  matrix  factors. 

The  situation  becomes  slightly  more  complicated 
when  L * ^ 1 .   For  PN  input,  there  is  only  one  pre-transfer 
butterfly  stage  before  inter-processor  transfers  become 
necessary.   There  will  be  the  same  number  of  transfer 
stages  as  before  (BR  input  case),  but  they  occur  in 
reverse  order.   This  leaves  log9M=L"5:"  computation  factors 
at  the  end  without  any  transfer  matrices  between  them. 
This  is  analagous  to  the  BR  input  case,  when  there  were 
L  *  computation  stages  prior  to  any  transfers. 

Recall  the  discussion  on  P«N  ordering  from  Section 
II. D. 4.   Each  processor  contains  two  subsequences  of  M/2 
points  in  normal  order,  each  subsequence  separated  by 
N/2.   Since  the  DIT  algorithm  combines  points  separated 
by  N/2  at  the  first  butterfly  stage,  the  PN  input  case 
calculates  M/2  2-point  DFTs  in  each  processor  at  the  first 
computation  stage.   There  are  (H/2)-l  dots  (no  entry)  in  each 
block  submatrix  corresponding  to  the  local  memory 
separation  between  points  combining  in  the  butterflies. 
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Refer  again  to  Figure  2.13.   The  butterfly  width  remains 
constant  for  successive  stages  preceded  by  inter-processor 
transfers,  and  decreases  when  computation  stages  are  no 
longer  preceded  by  transfer.   Again,  recall  the  BR  input 
case  where  butterfly  widths  grew  to  a  maximum,  determined 
by  local  memory  contents,  then  remained  constant  as 
transfer  stages  occurred.   Since  the  PN  input  sequence 
does  not  have  points  separated  by  N/4  (second  stage  DIT 
butterfly  width)  residing  in  any  processor,  the  first 
transfer  stage  occurs  after  the  first  butterfly  stage. 
Transfers  occur  until  log^P  stages  have  been  carried  out. 
The  final  L*  butterfly  stages  do  not  require  transfers 
between  them,  as  mentioned  above.   Each  transfer 
successively  permutes  the  pseudo-bit-reversed  row  labels, 
until  the  bit-reversed  output  sequence  emerges. 
3 .   Pseudo-Bit-Reversed  Ordering 

Further  discussion  of  the  nature  of  pseudo-bit- 
reversed  (PBR)  ordering  is  necessary.   Just  as  normal 
ordering  is  permuted  to  PN  ordering  by  the  DIT  MPFFT  using 
BR  input,  PBR  ordering  serves  as  the  starting  point  for 
PN  input  algorithms.   Also,  PBR  order  is  the  ordering  of 
the  output  that  would  result  from  a  SPFFT  DIT  algorithm 
using  PN  input  ordering.   The  terms  "PN"  and  "PBR  order" 
are  really  meaningless  when  talking  about  single-processor 
implementations,  since  each  is  a  function  of  the  number  of 
inter-processor  transfers.   However,  if  a  particular 
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configuration  is  assumed,  say  8  points/4  processors,  then 
that  particular  PN  ordering,  when  input  to  a  SPFFT,  would 
result  in  the  corresponding  (8/4)  PBR-ordered  output.   In 
any  case,  the  algorithm  of  Section  II. D. 4,  which  generates 
PN  from  normal  ordering  ,  can  also  be  used  to  generate  PBR 
ordering  from  bit-reversed  order.   To  use  the  algorithm 
as  presented,  the  bits  must  first  be  reordered  to  bit- 
reversed  ordering.   That  is,  instead  of  descending  order, 
the  bits  are  reordered  in  ascending  order.   For  example, 
a  16-point  data  sequence  requires  4  bits  to  describe  the 
set  in  binary.   These  bits  when  bit-reversed  are  ordered 
bn ,  b,  ,  b~,  b~.   Now  the  algorithm  may  be  applied 
directly  to  generate  pseudo-bit-reversed  order,  for  a 
given  number  of  processors.   Recall  that  the  algorithm 
switches  bits  L*-l  and  L*  at  the  first  transfer  stage, 
then  increments  both  bits  at  successive  stages.   Thus,  at 
the  second  stage,  bits  L*  and  L*+l  are  switched,  and  so  on. 
The  transfer  matrices  used  for  the  PN  input  case  occur 
in  reverse  order  from  the  BR  input  case.   Thus,  this 
algorithm  can  be  used  in  reverse,  to  give  the  stage-by-stage 
row  indices,  k,  and  ultimately,  the  BR-ordering  of  the 
output  sequence.   Refer  to  Table  1  for  a  display  of  these 
relationships  . 
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TABLE  1:   SEQUENCE  REORDERING  (DIT  MPFFT) 
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4.   DIF  MPFFT  Algorithms 

As  in  single-processor  algorithms,  the  multi- 
processor DIF  FFT  exhibits  much  duality  in  relation  to  the 
multi-processor  DIT  FFT.   The  MPFFT  matrix  E   can  again 
be  viewed  as  a  DFT  matrix  with  rows  and  columns  reordered 
corresponding  to  the  output  and  input  sequences.   The  E 
matrix  of  the  DIF  with  PN  input/BR  output  is  the  transpose 
of  the  E   matrix  of  the  DIT  with  BR  input/PN  output.   The 
transpose  of  a  product  of  matrices  is  the  product  of  the 
transposed  matrices,  in  reverse  order.   The  stage  matrix 
factors  of  the  DIF  with  PN  input  are  thus  the  transpose 
of  the  factors,  in  reverse  order,  of  the  DIT  with  BR 
input.   The  DIF  algorithm  partitions  the  output  sequence 
into  even  and  odd  values  of  k,  the  frequency  index.   The 
row  indices  now  represent  frequency  separation  at  the 
individual  computation  stages.   That  is,  the  first  matrix 
factor,  E, ,  has  frequencies  0  and  4.   The  units  of 
frequency  depend  on  the  analysis  period,  or  record  length. 
The  column  indices  start  out  in  the  order  of  the  input 
sequence  for  the  first  stage  factor.   They  are  then 
reordered  by  each  transfer  stage,  in  the  same  manner  that 
the  row  indices  are  permuted  in  the  DIT  algorithm.   Thus, 
the  column  indices,  n,  represent  the  points  of  the  input 
sequence  that  reside  in  a  given  processor  at  any  given  stage 
The  transfer  matrices  are  the  same  as  for  the  DIT,  and  occur 
in  the  same  order  as  the  DIT  with  the  same  input  ordering. 
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That  is,  the  DIF  with  PN  input  has  the  same  transfer 
pattern  as  the  DIT  with  PN  input.   The  pattern  is  the 
reverse  of  the  DIT  with  BR  input.   Since  the  transfer 
matrices  are  symmetric,  the  transpose  operation  leaves 
them  unchanged.   Thus,  the  individual  factors  of  the  DIF 
with  PN  input/BR  output  are  each  the  transpose,  in  reverse 
order,  of  the  factors  of  the  DIT  with  BR  input/PN  output. 
Since  the  ordering  is  reversed,  the  magnitudes  of  the 
transfer  matrices'  row  labels  are  given  by  the  frequency 
separation  at  the  previous  stage.   Recall  that  the 
magnitude  of  the  column  labels  on  the  transfer  matrices  for 
the  DIT  are  given  by  the  time  separation  for  the  next 
stage.  (See  Figure  2.15  and  compare  with  Figure  2.4.)  Similar 
duality  exists  between  the  DIF  with  BR  input/PN  output  and 
the  DIT  using  PN  input/BR  output.  The  E   matrices  are  the 
transpose  of  one  another,  and  the  stage  factors  of  the  DIF 
are  the  transpose  of  the  factors  of  the  DIT,  taken  in 
reverse  order.  (See  Figure  2.16  and  compare  with  Figure  2.14.) 

This  section  has  presented  the  factorization  of 
FFT  matrices,  to  obtain  their  multi-processor  representations. 
Following  sections  will  investigate  the  implementation  of 
other  fast  transforms  on  the  Butterfly  Network. 
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Figure    2.15      8-Point /4-processor 
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Figure  2.16   8-Point /4-Processor  DIF  FFT  (BR  Input) 
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III.   OTHER  FAST  TRANSFORMS  ON  THE  BUTTERFLY  NETWORK 

This  section  uses  the  tools  developed  previously  to 
investigate  other  fast  transform  algorithms  for  potential 
implementation  on  the  Butterfly  Network  (BN).   Section  II 
used  matrix  representations  to  show  that  the  radix-2  FFT 
algorithm  is  supported  by  the  BN .   That  is,  the  radix-2 
FFT  can  be  partitioned  so  that  multiple  processors  can 
carry  out  the  computation  in  parallel.   The  BN  supports 
the  required  inter-processor  transfers  of  partially- 
processed  data,  producing  transformed  output  in  a  new,  yet 
easily  decipherable,  order.   This  section  examines  several 
other  algorithms  to  determine  their  adaptability  to 
parallel  processing  on  the  BN .   The  analysis  will  focus 
on  the  required  inter-processor  transfer  patterns,  and 
the  output  ordering  produced  from  the  Butterfly  Network. 

A.   FAST  HARTLEY  TRANSFORM  (FHT) 

Bracewell  has  articulated  a  real,  fast  transform 
algorithm  that  serves  all  the  purposes  of  the  complex 
Fourier  transform  [Ref.  11].   It  is  called  the  Fast 
Hartley  Transform  (FHT)  in  honor  of  Hartley,  whose  1942 
formulation  of  a  real  integral  transform  is  the  basis  for 
the  FHT.   The  FHT  algorithm  is  faster  than  the  FFT,  and 
if  desired,  may  even  be  used,  with  a  little  additional  cost, 
to  compute  the  DFT. 
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As  shown  previously  with  the  DFT ,  the  discrete  Hartley 
transform  may  be  viewed  as  a  square  matrix  operating  on  an 
N-dimensional  vector.   The  DHT  matrix  can  then  be  factored 
to  obtain  the  FHT  algorithm,  where  each  factor  represents 
one  stage  of  the  algorithm.   The  step   from  the  Hartlev 
transform  to  the  Fourier  transform  can  also  be  expressed 
as  a  matrix  multiplication  and  so  the  factorization  of 
the  Hartley  matrix  leads  directly  to  a  new  factorization 
of  the  Fourier  matrix  [Ref.  11:  p.  64]. 

1 .   Single-Processor  FHT  CSPHFT)  Algorithm 

The  Discrete  Harley  Transform  (DHT)  is  defined 

[Ref.  11:  p.  27]  as  the  sum  of  the  cosine  and  sine  transforms 

N-l 
X(k)    =  ±         2  x(n)    cas(27rnk/N)  ,       k=0 , 1 , . . . , N-l 

n  =  0 

(3.1) 

where  cas  6     =  cos  6   +    slnd    .   The  inverse  transform  is: 

N-l 
x(n)  =   X   X(k)  cas  (2^nk/N),    n=0 , 1 , . . . , N-l     (3.2) 
k  =  0 

Note  that  only  real  arithmetic  is  used,  and  except  for  the 
factor  N    in  (3.1),  the  forward  and  inverse  transforms 
are  identical.   Thus,  for  real  input  data,  the  DHT  is  also 
real  valued,  and  the  original  sequence  can  be  obtained  by 
using  the  same  transformation  formula  a  second  time. 
Equation  (3.1)  may  be  represented  as  a  vector-matrix 
equation : 
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X(k)  =  H  x(n) 


(3.3) 


The  DHT  matrix  H  can  be  factored  to  give  the  FHT ,  similar 
to  the  way  FFTs  are  obtained  from  the  factorization  of  a 
DFT  matrix.   The  input  sequence  is  again  assumed  to  have 
length  equal  to  a  power  of  2.   Thus,  the  DHT  matrix  H  is 
factored  as  : 


5  *  tG  £g-i 


feii 


(3.4) 


where  G  =  log«N.   The  factors  L  ,  s=l , 2 , . . . , log^N  are 
called  stage  matrices,  and  represent  the  computations  at 
each  stage  of  the  algorithm.   The  input  sequence  x(n)  is 
in  bit-reversed  order,  and  yields  normal-ordered  output. 
Each  of  the  factors  in  (3.4)  are  sparse  matrices, 
contributing  to  the  speed  of  the  FHT  algorithm.   The  FHT 
employs  a  "divide  and  conquer"  strategy  similar  to  the 
FFT .   An  N-point  DHT  is  obtained  as  a  linear  combination 
of  the  outputs  of  two  N/2-point  DHTs ,  each  in  turn 
obtained  as  combinations  of  the  outputs  of  two  N/4-point 
DHTs,  etc.   The  decomposition  proceeds  until  2-point 
sequences  result.   The  DHT  of  a  2-point  sequence  is 
obtained  simply  as  the  sum  and  difference  of  the  2  data 
elements.   To  illustrate,  applying  the  definition  (3.1) 
for  N=2  yields: 
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X(0)  =  i[x(0)(cos  O+sin  0)+x(l)(cos  O+sin  0)] 


X(l)  =  i[x(0)(cos  O+sin  0 )+x( 1 ) ( cos ^     +sin  *  )]   (3.5) 


X(0)  =  i[x(0)+x(l) ] 


X(l)  =  i[x(0)-x(l)] 


(3.6) 


In  matrix  form 

1(0) 

X(l) 


ri 


-i 


t\ 


^(oy 


i_ 


x(l) 


(3.7) 


Reference  11  defines  a  general  decomposition  formula  that 
produces  the  DHT  by  successive  halving  of  the  input 
sequence.   Given  the  input  sequence  x  (  n  )  ,  n=0,l,...,N-l, 
let  the  N-element  subsequences  [x(0)  0  x(2)  0  . . . £  and 
{x(l)  0  x(3)  0  .  .  .}   have  DHTs  X^k)  and  X9(k)  respectively 
Then, 

X(k)  =  Xx(k)  +  X2(k)cos(27rk/N)+X2(N-k)sin(27rk/N) 

(3.8) 
This  structure  corresponds  to  the  factorization  of  the 
DHT  matrix,  H  [Ref.  11:  p.  85]. 

The  second  stages  compute  one  or  more  A-point 
DHTs,  using  two  2-point  DHTs  as  inputs.   The  cosine  and 
sine  multipliers  will  be  either  1 ,  0,  or  -1 ,  since 
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(2-n-nk/N)  =  (27rnk/4)  =  (nk7r/2),  and  only  integer  multiples 
of  t/2    are  thus  possible.   Applying  (3.1)  for  N=4  and 

putting  the  results  in  matrix  form  gives: 


X(0) 
X(l) 
X(2) 

X(3) 


1 

1 

1 

1 

1 

1 

-1 

-1 

1 

-1 

1 

-1 

1 

-1 

-1 

1 

L 


x(0) 
x(l) 
x(2) 
x(3) 


(3.9) 


The  columns  are  rearranged  to  correspond  to  bit-reversed 
input,  yielding: 


X(0) 

1 

1 

1 

1 

x(0) 

X(l) 

1 

-1 

1 

-1 

x(2) 

X(2) 

1 

1 

-1 

-1 

x(l) 

1(3) 

1 

-1 

-1 

1 

x(3) 

(3.10) 


Equation  (3.10)  can  be  factored,  since  half  the  terms  can 
be  computed  as  2-point  DHTs ,  as  mentioned  above.   This 
factorization  yields: 


X(0) 
X(l) 
X(2) 
X(3) 


1         1 

-1 
1        -1 


=  L0  L,  x( n) 


1 

1 

x(0) 

1 

-1 

x(2) 

1 

1 

x(l) 

1 

-1 

x(3) 

(3.11) 
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It  is  easily  verified  by  carrying  out  the  matrix 
multiplication  that  (3.11)  is  equivalent  to  (3.10). 
(In  this  development,  and  in  the  future,  the  N    factor 
is  neglected  for  clarity.   It  can  be  inserted  at  the 
end,  if  necessary  for  normalization.) 

The  third  and  subsequent  stage  factors,  representing 
8-  (or  more)  element  FHT  stage  matrices  are  more  complicated 
The  inputs  to  the  8-point  FHT  are  the  outputs  of  two 
4-point  FHTs .   This  recursive  property  for  generating 
higher-order  FHTs  from  the  combination  of  lower-order 
FHTs  never  changes.   What  does  change  is  the  number  of 
non-zero  elements  per  row  in  the  stage  matrices.   For  the 
third  and  subsequent  stages,  there  are  three  non-zero 
elements  in  each  row.   This  means  that  three  elements 
enter,  with  appropriate  coefficients,  into  each  output 
element.  The  third  stage  factor  is: 


b  " 


1 


'0 


Cl   Sl 
K„ 


S3       C3 


S       S5 


s7  C? 


(3.12) 
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where  C   =  cos(27rn/8),  S   =  sin(2?rn/8)  and  K   = 
n  n  n 

[  cos(  2irn/8)+sin(  27rn/8)  ]  .   Note  that  the  cosine  elements 
and  the  l's  are  on  diagonals  parallel  to  the  main  diagonal, 
but  the  sine  elements  run  the  other  way.   As  a  result, 
the  independent  variable  of  the  operand  element  that 

multiplies  the  sine  coefficient  S   decreases  as  n 

r  n 

increases.   This  property  is  referred  to  as  retrograde 
indexing  [Ref.  11:  p.  73].   The  arguments  of  the  sine  and 
cosine  terms  are  integer  multiples  of  vlk,    taking  on  a 
numerical  value  of  either,  0,  +1,  or  +  1 /V  JT   Putting 
numerical  values  in  the  L^  matrix  and  carrying  out  the 
matrix  multiplication  gives  the  result  shown  in  Figure  3.1. 
Again  it  is  easy  to  verify  that  the  product  in  Figure  3.1 
gives  the  same  result  as  the  DHT  definition  of  (3.1). 
Conversion  of  the  DHT  to  the  DFT  is  accomplished  simply 
by  associating  the  even  and  odd  parts  of  the DHT  with  the 
real  and  imaginary  parts  of  the  DFT.   For  a  discussion  of 
the  mechanics  of  this  operation,  see  reference  11,  pp.  75-77 
and  reference  12,  p.  150.   However,  if  the  power  spectrum 

is  desired,  it  can  be  computed  directly  from  the  DHT  by 

2  2 

evaluating  [X(k)]  +[X(N-1)]  .   Finally,  note  that  since 

the  FHT  is  a  real  transform  compared  to  FFT ,  which 

is  complex,  the  real  and  imaginary  parts  of  any 

complex  input  data  can  be  processed  in  parallel, 

respectively,  with  the  FHT  [Ref.  12:  p.  147]. 
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Figure  3.1   8-Point /1-Processor  DIT  FHT  (BR  Input) 


87 


2.   Multi-Processor  FHT  (MPFHT)  Algorithm 

The  Fast  Hartley  Transform  can  be  partitioned  among 
several  processors  for  parallel  implementation  on  the 
Butterfly  Network.   The  number  of  points  per  processor, 
M=N/P,  must  be  greater  than  or  equal  to  4,  since  the 
third  and  successive  computation  stages  ( N _>8 )  combine 
3  elements  at  a  time.   Thus,  there  are  at  least  two 
(log?M)  pre-transfer  stages  for  any  N-point  FHT.   The 
recursive  nature  of  the  FHT  yields  inter-processor  transfer 
requirements  similar  to  the  MPFFT.   That  is,  after 
L"::"=log?M  pre-transfer  stages,  there  are  transfer  stages 
interspersed  between  the  subsequent  post- transfer  computation 
stages.   As  always,  the  transfer  stages  permute  the  row 
ordering,  and  thus  the  ordering  of  the  output  sequence. 
Processors  again  retain  half  their  data,  and  transfer  half 
at  any  given  stage .   When  more  than  one  transfer  stage  is 
required  (4  or  more  active  processors),  the  patterns  of 
the  output  ordering  are  apparently  not  so  convenient  as  in 
the  MPFFT.   More  desirable  patterns  than  those  to  be 
presented  may  well  exist.   The  difficulty  lies  with  the 
fact  that  the  amount  of  data  transferred  is  always  a  power 
of  2  (one-half  each  processor's  memory),  while  3  points 
are  combined  in  the  third  and  successive  computation  stages. 

The  easiest  case  to  consider  is  an  8-point  FHT 
implemented  using  2  processors.   There  are  3  (log^8) 
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computation  stages:   2  pre-transfer  computation  stages 
(L*  =  log2(N/P)=2) ,  1  transfer  stage  (log2P=l)  and 
1  post-transfer  computation  stage.   For  this  case: 


X(k)  =  L~  T,  L„  L-  x(n) 

-*,       ~j  ~i  ~*£    —i  ~ 


(3.14) 


The  product  is  computed  in  Figure  3.2  and  can  be  compared 
with  Figure  3.1.   The  transfer  matrix  is  symmetric,  and 
was  chosen  in  keeping  with  the  invertability  property  of 
the  DHT .   Note  that  this  transfer  pattern  causes  bit- 
reversed  output,  when  the  input  is  also  in  bit-reversed 
order.   Again,  exactly  one-half  of  each  processor's  memory 
is  exchanged.   This  time,  however  processors  exchange 
alternating  memory  locations,  instead  of  the  upper-  or 

lower-half  as  in  the  MPFFT.   Processor  P   retains  its 

o 

even-labelled  memory  locations,  and  transfers  its  odd- 
labelled  memory  locations  in  exchange  for  processor  P, 's 
even-numbered  data.   The  order  of  computation  after 
transfer  is  somewhat  subjective,  and  was  chosen  to  minimize 
the  amount  of  buffering  required  within  each  processor. 
In  this  case,  the  requirement  of  only  2  additional  memory 
locations  per  processor  for  buffering  appears  to  be 
optimum . 

Extending  the  discussion  to  N=16  illustrates  the 
features  of  the  generalized  multi-processor  FHT.   For  a 
16-point  data  sequence  transformed  using  the  same  2 
processors,  the  vector-matrix  equation  extends  to: 
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Figure  3.2   8-Point /2-Processor  DIT  FHT  (BR  Input) 
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X(k)  -  L4  lx    L3  L2  Lj  x(n) 


(3.16) 


The  recursive  property  of  the  FHT  keeps  the  first  3  stage 

matrices  L,  ,  Ln ,  and  L~  of  (3.13)  the  same,  with  their 

patterns  repeating  along  the  main  diagonal.   Thus,  only 

the  transfer  stage  matrix  T,  and  the  final  16-point 

computation  stage  L,  need  be  discussed.   The  transfer 
r  ~4 

pattern  is  chosen  again  so  that  T-i  is  symmetric.   The 
symmetry  property,  along  with  the  property  of  only  one 
non-zero  element  (whose  value  is  unity)  per  row  and 
column,  makes  the  transfer  matrices  presented  herein 
orthogonal.   Processor  .P~  once  again  transfers  its  odd- 
labelled  memory  locations  (this  time,  4  elements)  to  P. , 
in  exchange  for  P,  Ts  even-numbered  element  .     The  order 
of  computation  obviously  determines  output  ordering. 
This  will  be  discussed  further  in  the  next  section. 
Matrices  T,  and  L,  are  shown,  as  is  their  product,  in 

~1        ~4 

Figure  3.3.   The  fourth  stage  compuation  matrix  L,  of 

the  single-processor  product  L,  •  T 1  is  equivalent  to  the 

-~4     ~  1 

single-processor  L,  matrix  with  reordered  rows.   Since, 
for  any  given  row,  the  column  entries  are  the  same 
both  implementations  perform  the  same  computation, 
although  the  ordering  of  their  outputs  is  different. 

Consider  now  the  16-point  FHT  implemented  on  a 
4-processor  Butterfly  Network.   This  illustrates  the 
"maximum  transfers"  case  for  the  FHT  since  more  than  3 
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Figure  3.3   Fourth  Stage  of  1 6-Point /2-Processor  MPFHT 
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points  per  processor  are  required.   This  is  not  really  a 
problem,  since  the  multi-processor's  advantage  lies  in 
the  computations  that  can  be  done  in  parallel,  whether 
prior  to  or  after  the  required  inter-processor  transfers. 
Reference  11  also  contains  a  program  called  FASTPERMUTE 
that  speeds  up  the  required  reordering  of  the  input 
sequence,  from  normal  to  bit-reversed  order.   This  thesis 
assumes  that  a  bit-reversed  sequence  is  made  available 
as  the  input  to  the  multi-processor  algorithms  presented. 

The  vector-matrix  equation  for  the  16-point/ 
4-processor  MPFHT  is: 

X(k)  =  L,  T0  L;  L,T,  L0  L,  x(n)  (3.17) 

^       ~4  ~2  ~4  ~4  I  ~z  ~1  ~ 

where  part  of  the  fourth  computation  stage  is  performed 
prior  to  the  second  transfer,  for  reasons  to  be  explained 
below.   The  pre-transfer  stage  matrices,  L,  and  L0  are 
identical  to  the  single-processor  versions,  and  duplicate 
the  patterns  of  the  2-  and  4-point  FHTs  as  submatrices 
along  the  main  diagonal.   The  first  transfer  stage  T, 
occurs  at  the  same  place  as,  and  duplicates  the  pattern 
of,  the  8-point/2-processor  MPFHT.   This  should  be 
expected,  since  both  have  the  same  value  of  L*  =  log2M=log2( N /P) , 
and  it  is  this  value  that  drives  inter-processor  transfer 
requirements.  The  first  post-transfer  computation  stage 
is  also  identical  to  the  8-point/2-processor  case,  again 
with  the  element  pattern  repeated  along  the  main  diagonal. 
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These  patterns  were  chosen  in  keeping  with  the  recursive 
property  of  the  single-processor  FHT  algorithm.   That  is, 
an  8-point  FHT  can  be  made  up  from  the  outputs  of  two 
4-point  FHTs ,  each  in  turn  derived  from  the  outputs  of 
two  2-point  FHTs.   This  property  is  responsible  for  the 
lower-order  computation  matrices  repeating  as 
submatrices  along  the  diagonal  in  the  early  stages  of 
higher-order  FHTs. 

The  fact  that  the  FHT  combines  3  elements  into 
one  output  element  at  the  third  and  successive  computation 
stages  causes  problems  when  more  than  2  processors  are 
used.   Recall  that  the  number  of  active  processors,  P, 
sets  the  number  of  inter-processor  transfer  stages  (  1  o  g  9  P ) 
Thus,  for  4  or  more  active  processors,  there  are  2  or  more 
transfer  stages.   Since  the  transfers  occur  after  all  the 
pre-transfer  computations  (for  N=16  or  a  higher  power 
of  2)  the  transfers  occur  between  stages  involving  3 
points  in  the  computations.   Therefore,  an  inherent 
incompatibility  exists  between  the  transfers,  involving 
a  number  of  points  equal  to  a  power  of  2  (one-half  of 
each  processor's  local  memory),  and  calculations  involving 
3  elements.   Figure  3.4  illustrates  that  all  elements  that 
are  other  than  1  (multiples  of  sin(7r/8)  and  cos(?r/8))  fall 
into  the  right-half  of  matrix  L.  (columns  8-15).   The 
non-unity  coefficients  multiply  elements  x  (8)  through 
x  (15),  which  are  outputs  from  stager  .   Note  in  LQ 
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Figure  3.4   Fourth  Stage  of  16-Point/l-Processor  FHT 
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of  Figure  3.5  that  all  these  elements  occur  in  processors 
P?  and  P~.   Thus,  one  way  around  the  problem  is  to 

partially  pre-compute  the  fourth  stage  prior  to  transfer. 

After  the  inter-processor  transfer,  stage  L,  finishes  the 

16-point  MPFHT  computation,  using  only  2  elements  in  the 

calculation  of  each  output  point. 

Note  that  the  partial  pre-calculation  (stage  L' 

in  Figure  3.5  and  Equation  3.17)  leaves  processors  Pn  and 

P,  idle  at  that  stage.   This  inefficiency  pays  dividends 

in  the  final  stage  4  computation  (L,),  since  this  stage 

now  only  involves  sums  and  differences  of  2  elements.   No 

synchronization  problems  should  be  generated,  since  the 

next  stage  is  a  transfer  stage,  involving  handshaking. 

Partial  precomputaton  is  only  necessary  when  2  or  more 

transfers  are  required,  and  arises  in  part  due  to  the 

decisions  made  involving  patterns  of  the  first  transfer 

and  order  of  computation  of  the  first  post- transfer 

computation  stage.   The  products  of  the  third  and  fourth 

stages  of  the  16-point /4-processor  MPFHT  are  shown  in 

Figure  3.6.   It  can  be  verified  by  comparing  Figure  3.6a 

with  Figure  3.4  that  the  multi-processor  implementation 

performs  identical  computations  as  the  single-processor 

algorithm.   The  output  is  reordered  in  the  multi-processor 

version,  as  indicated  by  the  row  reordering  caused  by 

transfer  T0  .   The  columns  of  the  product  L,  Tn    L ,' , 
~2  —4  ~z  ~4 
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16-Point/4-Processor  MPFHT  (Stages 
3  and  4)  (Continued) 
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are  reordered  as  well.   This  reflects  the  fact  that  the 
input  to  the  partial  pre-coraputation  stage  (!■!)  comes  from 
the  output  of  stage  L~  .   This  sequence  was  permuted  by 
the  first  transfer  in  the  4-processor  implementation,  T,  . 
This  is  verified  by  comparing  Figure  3.6b  with  Figure  3.1, 
which  shows  identical  elements  in  any  given  row,  with  the 
rows  of  Figure  3.6b  reordered  due  to  transfer  T.  . 
Figure  3.6b  can  also  be  compared  with  Figure  3.2.   Note 
that  each  factor  in  Figure  3.6b  is  a  duplicate  of  the 
factors  in  Figure  3.2,  as  is  the  final  product.   Thus, 
the  MPFHT  retains  the  recursive  property  of  the  FHT ,  in 
that  the  initial  stages  of  higher-order  FHTs  are  duplicates 
of  the  outputs  of  lower-order  FHTs,  as  desired. 
3 .   Transfer  Patterns  and  Output  Ordering 

The  single-processor  FHT  (SPFHT)  accepts  bit- 
reversed  input,  and  generates  a  normal-ordered  output 
sequence.   The  multi-processor  FHT  (MPFHT)  implemented 
on  the  Butterfly  Network  also  accepts  input  in  bit-reversed 
order.   The  transformed  output  sequence,  however,  is 
permuted  by  the  inter-processor  transfers  required.   The 
permuted  output  is  not  the  pseudo-normal  (PN)  ordering 
characteristic  of  multi-processor  FFTs ,  for  reasons  that 
will  be  explained  shortly.   Normal-ordered  output  may 
again  be  obtained  by  additional  network  transfers,  or  by 
cyclically  reading  the  processors'  local  memories. 
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Recall  the  earlier  statement  that  the  "maximum 
transfers"  case  of  the  MPFFT  won't  work  with  the  MPFHT. 
That  is,  4  or  more  points  must  be  contained  in  each 
processor.   The  situation  involving  the  maximum  number 
of  transfer  stages  in  the  MPFFT  occurs  when  only  2  points 
are  contained  in  each  processor.  Thus,  the  first  transfer 
of  the  "maximum  transfers"  MPFFT  case  (L*=log„M=l)  is 
effectively  bypassed  in  the  MPFHT  "maximum  transfers" 
case  (L"x"  =  2).   With  this  in  mind,  the  algorithm  of 
Section  II. D. 2  for  generating  transfer  matrices  can  be 
modified  to  generate  the  MPFHT  transfer  patterns.   The 
modification  to  the  "bit-switching"  algorithm  of 
Section  II. D. 2  is  necessary  because  of  which  memory 
locations  are  involved  in  the  MPFHT  transfers.   Recall 
that  the  MPFFT  transfers  involved  the  upper-  or  lower-half 
of  a  processor's  local  memory.   The  MPFHT  algorithm 
transfers  the  even-  or  odd-numbered  local  memory 
locatons ,  instead  of  adjacent  upper-  or  lower-half 
memory  locations.   This  all  relates  to  the  properties 
of  counting  in  binary.   That  is,  every  even  decimal 
number  has  a  binary  representation  ending  in  0,  while 
every  odd  decimal  number's  last  bit  is  1  when  expressed 
in  binary.   The  modified  algorithm  for  generating  MPFHT 
transfer  matrices  is  described  below. 
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As  in  the  MPFFT  case,  the  MPFHT  transfer  matrices 
are  viewed  as  permutations  of  NxN  identity  matrices. 
This  procedure  determines  the  column  indices  where  l's 
appear,  by  switching  appropriate  bits  in  the  binary 
representation  of  the  row  indices.   The  bits  being 
switched  differ  from  those  of  the  MPFFT  transfer  matrix 
algorithm.   Again,  this  is  due  to  the  fact  that  the  half 
of  each  processor's  memory  transferring  is  now  based  on 
even-  and  odd-numbered  local  memory  locations.   Not 
surprisingly  then,  one  of  the  bits  to  be  switched  is 
always  the  last,  or  Oth  bit,  since  this  bit  determines 
even  or  odd.   (Bits  are  again  labelled  by  bT  -  .  .  .  b~b , b„ , 
where  L=( log~N ) -1 . )   The  first  transfer  stage  swaps  bit 
b„  with  bit  b  "x",  where  L*;$'  =  log^M=log7  (N/P)  .   Successive 
transfer  stage  switch  b~  with  bT#+.  ,  then  with  b,  ^  « , 
etc.   For  example,  in  the  1 6-point /4-processor 
implementation  of  Figure  3.5,  log~N=4  and  L*=2 .   Thus,  the 
binary  row  indices  have  bits  labelled  b^b^b-jbp..   The 
first  stage  transfer  matrix  T   determines  column 
position  of  elements  by  switching  b~    with  b„  in  the  row 
indices.   The  second  transfer  matrix  T0  switches  bn    and  b„. 
Source  and  destination  processors  are  identical  to  the 
MPFFT  case,  and  can  be  calculated  using  the  cube,._,>. 
functions  to  find  a  destination,  given  a  source. 
Additionally,  at  the  ith  transfer  stage,  the  (i-l)st 
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bit  of  the  processor's  binary  address  determines  whether 
a  processor  transfers  its  even  or  odd  memory  locations. 
If  the  (i-l)st  bit  is  zero,  the  processor  transfers  its 
odd-numbered  memory  locations,  and  if  the  bit  is  one, 
the  processor  transfers  even-numbered  memory  locations. 
As  in  the  MPFFT  with  "upper-half"  and  "lower-half" 
processors,  those  processor's  transferring  their  odd- 
numbered  memory  locations  receive  the  even-numbered 
locations  of  another  processor.   This  convenient 
property  again  makes  all  transfer  matrices  symmetric 
and  orthogonal.   See  Figure  3.7  for  a  graphical  display 
of  this  example. 

Discussion  of  the  reordering  of  the  output 
sequence  can  be  simplified  by  distinguishing  between  bit 
postions  and  bit  labels.   Let  the  normal-ordered  output 
sequence  have  binary  representation  with  bit  labels 
b„b,?b,bn,  occupying  bit  locations  (positions)  b^b^b^bj-. 
(N=16).   Now  the  algorithm  j ust.  discussed  generates 
column  indices  from  row  indices,  when  applied  to  bit 
locations.  That  is,  each  transfer  stage  starts  anew 
assuming  bit  locations  b„b„b,br)  for  row  indices,  then 
switches  the  two  appropriate  bits  for  that  transfer  stage 
to  give  column  positions  where  l's  occur.   Since  the 
transfer  stage  permutes  the  output  ordering  cumulatively, 
the  bit  labels  are  not  restarted  at  successive  transfer 
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stages.   For  example,  the  16-point /4-processor  case  begins 

with  bit  labels  b^b^b^b^  in  bit  locations  b^b-b^p..   The 

first  transfer  T,  switches  b   and  b0  to  give  bit  labels 

~1  o       2     & 

(output  sequence)  and  bit  locations  (column  indices) 
bob^b^bo.   The  second  transfer  T0  again  starts  with  row 
ordering  in  bit  locations  b„b„b-,b    while  the  bit  labels 
are  left  partially-reordered  (  bobrjb,  b«)  •   Then,  switching 
b„  and  b„  gives  column  indices  ordered  bnb7b,bo>  while 
the  output  sequence  is  b~b~b -i  b.,  .   When  the  bits  of  the 
respective  binary  representations  are  permuted  according 
to  this  rule,  the  correct  column  positions  and  output 
sequence  order  result.   This  procedure  was  used  when 
labelling  the  figures  in  this  section.   The  column 
indices  correspond  to  input  ordering  at  a  given  stage,  and 
row  indices  give  output  ordering  from  that  stage. 
Therefore,  the  row  labels  of  the  final  computation  stage 
give   output  ordering   in  all  cases. 

To  summarize,  the  Fast  Hartley  Transform  is 
realizable  on  the  multi-processor  Buttrerfly  Network. 
Several  pertinent  features  of  the  implementation  have  been 
presented,  but  details  of  the  features  and  properties  of 
this  new  algorithm  have  of  necessity  been  only  partially 
covered.   The  interested  reader  is  referred  to  reference  11 
for  a  more  detailed  discussion  of  this  algorithm  that  has 
great  potential  for  signal  processing  applicatons. 
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B.   FAST  WALSH-HADAMARD  TRANSFORM  (WHT) 

While  the  DFT  uses  sinusoids  of  harmonic  frequencies 
as  its  basis  functions,  other  transforms  can  be  defined 
using  other  bases.   The  Walsh  functions,  named  for 
J.L.  Walsh  who  introduced  them  in  1923,  are  the  basis 
functions  for  the  Walsh-Hadamard  transform  (WHT) 
[Ref.  2:  p.  301].   Walsh  functions  can  be  generated 
recursively,  are  orthogonal,  and  form  a  closed  set. 
Since  they  are  square  waves,  Walsh  functions  take  on 
only  two  values,  namely  +1  or  -1  [Ref.  13:  p.  225]. 
This  property  enables  the  WHT  to  be  calculated  using  real 
number  operations,  involving  only  additions  and 
subtractions.   This  section  presents  two  of  the  several 
possible  ordering  schemes  for  calculating  the  WHT. 
After  introduction  of  the  single-processor  implementation, 
the  WHT  will  be  examined  for  compatibility  with  the 
multi-processor  Butterfly  Network.   Again,  the  emphasis 
will  be  on  inter-processor  transfer  requirements  and 
output  ordering. 

1  .   Single-Processor  Walsh  Ordered  Transform  (WHT)w 
Walsh  functions  can  be  rearranged  to  form 
different  ordering  schemes  such  as  Walsh  or  sequency 
order,  Hadamard  or  natural  order  (shown  in  Figure  3.8), 
Paley  or  dyadic  order,  and  cal-sal  order  [Ref.  2:  p.  303], 
This  section  examines  the  transform  derived  from  the 
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Walsh  functions  are  described  in  terms  of  sequency, 
which  combines  the  number  of  sign  changes  and  frequency. 
Sequency  is  defined  as  the  number  of  sign  changes  per  unit 
interval : 


sequency  = 


i(z..c.) 

Hz.c.  +  i) 


even  Z.C 
odd  Z.C. 


(3.18) 


where  Z.C.  is  the  average  number  of  zero  crossings  per 
unit  interval  [Ref.  2:  p.  305]. 

Sampling  of  the  Walsh  functions  at  equally  spaced 
sample  points  generates  the  Walsh-Hadamard  matrices  in 
corresponding  ordering.   Periodic  sampling  of  the 
Walsh  functions  in  sequency  order  at  8  equi-spaced 
sample  points  gives: 
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3   3 
where  [Hw(3)]  is  thus  the  2  x2   Walsh-Hadamard  marix  in 

Walsh  or  sequency  order.   The  rows  represent  Walsh 

functions  whose  sequencies  are  listed.   Whereas  W~ 

r 
defines  the  DFT  matrix  for  N=2  ,,  [Hw(G)]  defines  the 

(WHT)w  matrix  [Ref.  2:  p.  310].   The  (WHT)w  is  thus 

defined  as : 


X(k)  =  (l/N)[Hw(G)]x(n) 


(3.20) 


while  the  input  data  sequence  can  be  recovered  from  the 
inverse  transform: 


x(n)  =  [Hw(G)]X(k) 


(3.21) 


The  same  algorithm  can  be  used,  then,  to  generate  the 

transform  or  its  inverse,  except  for  the  (1/N)  factor. 

One  fast  algorithm  that  results  from  the  marix  factorization 

of  (3.19)  is  shown  in  Figure  3.9.   This  factorization 

2 
reduces  computation  requirments  from  N   to  Nlog^N  additions 

and  subtractions,  where  N  is  again  the  length  of  the  input 
sequence.   The  algorithm  represented  in  Figure  3.9  accepts 
bit-reversed  input,  and  generates  a  transformed  output 
sequence  in  normal  order.   Note  also  that  this  algorithm 
can  be  computed  in-place,  thus  minimizing  extra  memory- 
required  for  buffering.   The  f actored-mat rix  equation 
corresponding  to  Figure  3.9  is  shown  in  Figure  3.10. 
Again,  each  matrix  factor  in  (3.22)  represents  one  butterfly- 
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computation  stage.   The  final  product,  W»   is  equivalent 
to  [Hw(3)]  in  (3.19),  with  the  columns  reordered  to 
correspond  to  the  bit-reversed  input  .   The  rows  of  W 
are  in  the  normal  order  in  which  the  output  appears. 

2 .   Multi-Processor  Walsh  Ordered  Transform  (MPWHT)w 
The  Walsh  or  sequency  ordered  transform  (WHT)w 
can  be  partitioned  among  several  processors  for 
implementation  on  the  Butterfly  Network.   An  alternate 
(equivalent)  signal  flowgraph  to  enable  partitioning  is 
shown  in  Figure  3.1.1.   It  is  easily  verified  that  the 
flowgraph  of  Figure  3.11  represents  identical  computations 
to  those  of  Figure  3.9.   Note  that  multi-processor 
implementation  of  Figure  3.9  directly  is  not  possible 
due  to  the  ordering  of  points  participating  in  the  first 
butterfly  stage.   The  flowgraph  of  Figure  3.11  allows 
points  participating  in  first  stage  butterfly  computations 
to  be  input  to  the  same  processors  when  implemented  using 
more  than  one  processor.   The  input  data  is  normal- 
ordered,  however,  and  the  algorithm  produces  bit-reversed 
output.   Thus  the  WHT  matrix  W   is  the  transpose  of  the 
W   matrix  of  the  BR  input /normal-ordered  output  case. 
The  stages  perform  the  same  computations,  but  in 
different  order.   Thus  the  matrix  factorization  represented 
by  the  signal  flowgraph  of  Figure  3.11  represents  an 
alternate  factorization  of  the  WHT  matrix.   This 
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alternate  factorization  is  shown  in  Figure  3.12,  and 
is  the  basis  for  the  multi-processor  implementation  of 
the  WHT  described  in  the  following  paragraphs. 

The  factorization  of  (3.23)  shown  in  Figure  3.12 
allows  the  multi-processor  Walsh  ordered  transform 
(MPWHT)w  to  parallel  the  development  of  the  MPFFT 
presented  in  Section  II.   The  length  of  the  input  sequence 
and  the  number  of  active  processor  are  again  assumed  to 
be  powers   of  2.   There  are  log„N  computation  matrix  factors 
each  representing  a  butterfly  computation  stage. 
(N=input  record  length.)   These  computation  matrices  are 
again  divided  into  L*x"  =  log9M  pre-transfer  stages  and 
log^P  post-transfer  stages,  with  log~P  inter-processor 
transfer  stages.   (P=number  of  active  processors, 
M=N/P=number  of  points  in  each  processor.)   The 
"maximum  transfers"  case  again  occurs  when  L*=l  and 
each  processor  contains  only  2  data  points.   For  the 
8-point  WHT  on  4  processors,  the  vector-matrix  equation 
factors  as  : 

X(k)  =  W„  T0  W„  T,  W-  x(n)  (3.24) 

where  the  W  matrices  represent  butterfly  stages,  the  T 
matrices  describe  transfer  stages,  x(n)  is  the  input 
sequence,  and  X(k)  is  the  transformed  output  sequence. 
The  factorization  of  (3.24)  is  shown  in  Figure  3.13  and 
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Figure  3.12   8-Point/l -Processor  Fast  WHTw  (Normal  Input) 
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Figure    3.13      8-Point /4-Processor    Fast    WHTw    (Normal    Input) 
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represents  the  multi-processor  implementation  of  (3.23) 
from  Figure  3.12.   Note  the  effect  of  the  multi-processor 
implementation  on  the  ordering  of  the  output  sequence. 
The  single-processor  WHTw  produces  bit-reversed  output 
when  the  input  is  normal-ordered  (Figure  3.12).   The 
transfers  in  the  multi-processor  algorithm  reorder  the 
output  to  pseudo-bit-reversed  (PBR)  order.   Note  also 
that  the  transfer  matrices  are  the  same  as  the  8-point/ 
4-processor  MPFFT  with  BR  input  (Figure  2.4).   This 
convenient  property  means  that  the  algorithm  of 
Section  II. D.  4,  which  transforms  normal  to  pseudo-normal 
ordering,  may  be  used  to  permute  BR  (single-processor 
output)  into  PBR  ordering.   A  slight  modification  must  be 
made  to  account  for  BR  order  as  the  starting  point, 
rather  than  normal  ordering.   Whereas  the  bit  positions 
are  labelled  b9b,bn  for  the  8-point  normal-to-PN 
reordering  algorithm,  the  bit  positions  must  be 
relabelled  bnb,b?  to  correspond  to  BR-to-PBR  reordering. 
This  is  the  same  idea  as  shown  in  Table  1 ,  and  results 
in  the  appropraite  reordering  when  the  "bit-switching" 
algorithm  is  applied. 

Just  as  with  the  MPFFT,  the  MPWHTw  can  also 
be  implemented  using  2  processors.  In  this  case  the 
equation  is: 

X(k)  =  W„  T,  W„  W-  x(n)  (3.26) 

"s*         ~*>o    <^»JL  *^Z  >vl  <**> 
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and,  as  in  the  MPFFT,  there  are  2  pre-transfer  butterfly 
stages,  a  transfer  stage,  and  1  post-transfer  butterfly 
stage.   Transfer  matrix  T-,  is  again  identical  to  T,  of 
the  8-point /2-processor  DIT  MPFFT.   The  output  is  again 
in  PBR  order,  though  different  from  the  8-point /4-processor 
PBR  order.   Since  there  is  only  one  transfer  in  the 
2-processor  case,  its  output  is  less-reordered  from  the 
single-processor  BR  ordering.   This  is  analogous  to  the 
PN  ordering  of  the  MPFFT,  which  was  shown  to  also  be  a 
function  of  the  number  of  transfers  required. 

The  MPWHTw  may  also  be  implemented  using  PBR  input, 
to  produce  normal-ordered  output.   Using  the  signal 
flowgraph  of  Figure  3.11,  now  with  PBR-ordered  input, 
the  8-point /4-processor  MPWHTw  matrix  factors  as: 

X(k)  =  W~  T  '  W9  T,  !  W,  x(n)  (3.27) 

where  T ,'=T0  of  the  normal  input  algorithm  and  T0'=T, 
of  (3.24)  and  Figure  3.13.   TheWHTw  matrix  W   is  the 
transpose  of  the  product  W   from  Figure  3.13.   The 
individual  stage  factors  are  different,  however.   They 
represent  the  same  computations  at  each  stage,  but  in  a 
different  order,  since  the  input  ordering  is  changed. 
These  ideas  are  illustrated  in  Figure  3.14.   The 
situation  is  slightly  more  complicated  when  only  2 
processors,  requiring  only  1  transfer  stage,  are  used 
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Figure  3.14   8-Point /4-Processor  Fast  WHTw  (PBR  Input) 
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with  a  PBR-ordered  input  sequence.  The  vector-matrix 
equation  in  this  case  is: 


X(k)  =  W0  W0  T.  W.  x(n) 
~w        xj  ~~Z    ~1  ~I  ~ 


(3.29) 


where  the  output  is  again  produced  in  normal  order.   The 
transfer  matrix  T,  is  identical  to  T-.  of  the  8-point/ 
2-processor  MPWHTw  with  normal-ordered  input  ,  but  occurs 
after  only  one  pre-transfer  butterfly  stage.   The 
product  of  (3.29)  is  again  the  transpose  of  the  W   matrix 
of  the  normal  input  case,  and  the  same  calculations  are 
performed  at  each  stage,  again  in  different  order.   Thus, 
for  PBR  input,  there  are  log~P  butterfly  stages  with 
log^P  transfers  between  each,  then  log9M  post-transfer 
butterfly  stages.   The  Walsh  ordered  transform  is  easily 
implementable  on  the  multi-processor  Butterfly  Network, 
and  exhibits  many  of  the  same  properties  as  the  multi- 
processor FFT.   It  can  be  used  with  normal-ordered  input, 
to  yield  pseudo-bit-reversed  output,  or  vice  versa. 

3 .   Single-Processor  Hadamard  Ordered  Transform  (WHT)h 
Walsh  functions  can  be  rearranged  to  form 
Hadamard  or  natural  ordering.   The  periodic  sampling  of 
these  Hadamard-ordered  Walsh  functions  yields  Hadamard 
matrices,  which  can  be  generated  recursively  [Ref.  2:  p.  314, 
13:  p.  226]  as  follows: 
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[Hh(0)]  =  [1] 
[Hh(k+1)]  = 


[Hh(k)]    [Hh(k)] 
[Hh(k)]   -{Hh(k)] 


k  =  0,l log2N 


(3.30) 
The  Hadamard  ordered  transform  is  similar  to  (3.20),  and 
is  defined  as : 


X(k)  =  (l/N)[H,(G)]x(n) 
<^»  n     ~ 


(3.31a) 


where  G=log?N.   The  inverse  (WHT)h  is  defined  as: 

x(n)  =  [H  (G)]X(k) 
<>»        n    /-w 


(3.31b) 


The  Hadamard  ordered  transform  is  also  referred  to  as  the 
binary  Fourier  representation  (BIFORE)  transform.   Fast 
algorithms  can  also  be  developed  for  the  (WHT)h  based  on 
matrix  factorizations.   The  Hadamard  matrix  for  an  8-point 
transform  can  be  formed  by  3  successive  applications  of 
(3.30)  : 


[Hh(3)]  = 


0 
1 

2 
3 
4 
5 

6 
7 


0 


5   6 


Sequency 
0 
4 
2 
2 
1 
3 
1 
3 


(3.32) 
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This  matrix  can  be  factored  into  3(log„N)  matrix  factors, 
each  representing  a  computation  stage  of  a  fast  WHTh 
algorithm.   One  particular  factorization,  using  normal 
input  and  producing  normal  output,  is  shown  in  Figure  3.15. 
4 .   Multi-processor  Hadamard  Ordered  Transform  (MPWHT)w 
Another  method  of  factoring  (3.32)  is  shown  in 
Figure  3.16.   Again,  this  factorization  is  not  unique, 
but  allows  for  efficient  implementation  of  (3.32).   The 
corresponding  single-processor  signal  flowgraph  is  shown  in 
Figure  3.17.   This  factorization  accepts  bit-reversed  input, 
and  generates  bit-reversed  output.   This  method  is 
compatible  with  a  multi-processor  implementation,  in 
that  points  participating  in  the  first  butterfly  stage 
are  adjacent  in  the  input  sequence.  It  is  easy  to  verify 
that  W   from  Figure  3.16  is  equivalent  to  (3.32),  with 
rows  and  columns  reordered  to  reflect  the  changes  in 
output  and  input  ordering,  respectively. 

The  algorithm  of  (3.34)  and  Figure  3.17  can  be 
partitioned  for  implementation  on  the  multi-processor 
Butterfly  Network.   For  an  8-point  WHTh  implemented  using 
4  processors,  the  vector-matrix  equation  is: 

X(k)  =  W,  T0  W9  T-  W,  x(n)  (3.35) 

where  the  input  sequence,  x(n),  is  in  BR  ordering.   The 
output  is  permuted  into  PBR  order  by  the  necessary 
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Figure  3.15   8-Point /l -Processor  Fast  WHTh  (Normal  Input) 
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Figure    3.16       8-Point /1-Processor    Fast    WHTh    (BR    Input) 
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inter-processor  transfers.   The  matrix  product  of  (3.35) 
is  carried  out  in  Figure  3.18.   Note  that  the  transfer 
matrices  T-,  and  T0  are  again  identical  to  the  8-point/ 
4-processor  DIT  MPFFT  with  BR  input.   The  product  ^W   is 
equivalent  to  the  W   matrix  of  the  WHTh  single-processor 
implementation  with  rows  reordered,  reflecting  the 
reordering  of  the  output. 

An  8-point  WHTh  implemented  on  2  processors  can 
be  expressed  as: 

X(k)  =  W„  T.  W0  w\  x(n)  (3.36) 

where  T,  is  the  same  as  T,  of  the  Walsh-ordered  MPWHT , 
*~  l  ~i 

again  identical  to  T,  of  the  8-point /2-processor  DIT 
MPFFT  with  BR  input.   In  this  case  also,  the  output  is  in 
PBR  order,  as  would  be  expected.   Thus,  the  algorithms 
from  Section  II  for  transfer  matrix  generation  and  output 
reordering  may  be  used  in  this  case  also. 

Another  alternative  factorization  of  the 
Hadamard  transform  is  to  use  pseudo-normal  input  data. 
This  option  is  desirable,  since  the  output  is  produced 
in  normal  order,  as  the  Walsh  ordered  transform  produces 
normal  output  when  the  input  sequence  is  in  PBR  order. 
The  8-point/4-processor  MPWHTw  using  PN  input  is 
expressed  as : 

X(k)  =  W~  T  '  W0  T, f  W,  x(n)       -  (3.37) 
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Figure  3.18   8-Point /4-Processor  Fast  WHTh  (BR  Input) 
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and  the  product  is  carried  out  in  Figure  3.19.   Similar 
to  the  Walsh  ordered  transform  when  PBR  input  is  used, 
the  transfers  of  (3.37)  occur  in  reverse  order  from 
the  BR  input  case  of  (3.35).   That  is,  T  '  of  (3.37)  is 
identical  to  T0  of  (3.35),  as  areT0'  and  T, .   When  fewer 
processors  are  used,  the  transfer  matrices  are  also  the 
same,  and  also  occur  in  reverse  order.   This  has  the 
effect  of  reversing  the  relationship  between  pre-  and 
post-transfer  butterfly  stages.   Now  there  are  log?P 
butterfly  and  transfer  stages,  followed  by  log^M 
post-transfer  butterfly  stages  with  no  transfers  between 
the  post-transfer  butterflies. 

In  this  section  all  normalizing  factors  of  (1/N) 

r 

for  the  forward  transforms  have  been  omitted  to  allow 
concentration  on  the  transfer  patterns  and  reordering 
of  output  sequences.   They  are  easily  included  at  the 
end,  if  necessary. 

The  ideas  presented  in  this  section  are  easily 
extended  to  larger  data  lengths  and  more  processors  as 
follows.   For  input  record  length  N,  there  are  always 
log„N  total  butterfly  stages.   For  P  active 
processors,  there  are  log„P  butterfly  stages  with  inter- 
processor  transfer  stages  between  each.   These  occur  after 
L'K'  =  log9M  pre-transfer  butterfly  stages  (BR  input),  or 
prior  to  L*  butterfly  stages  (PN  input).   The  transfer 
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Figure  3.19   8-Point /4-Processor  Fast  WHTh  (PN  Input) 
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patterns  and  output  reordering  for  the  Walsh  transform 
(Walsh  or  Hadaraard  order)  parallel  those  of  the  multi- 
processor FFT.   Thus,  algorithms  for  FFT  transfer 
matrix  generation  are  applicable  to  the  corresponding 
WHT  transfer  patterns.   Reordering  of  output  sequences 
from  the  single-processor  WHT  are  likewise  obtainable 
from  the  corresponding  MPFT  algorithm. 

Finally,  the  Walsh  transform  is  very  compatible 
with  the  multi-processor  Butterfly  Network.   Both  Walsh 
and  Hadamard  orderings  have  many  of  the  same  properties  as 
the  MPFFT  when  implemented  on  the  BN .   Thus,  ideas  from 
Section  II  are  also  applicable  to  the  WHT. 
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IV.   COMBINED  ALGORITHM  FOR  TRANSFER  MATRIX 
GENERATION  AND  MULTI-PROCESSOR  OUTPUT 
ORDERING 

The  algorithms  presented  in  Section  II  for  generating 
transfer  matrices  and  output  reordering  can  be  combined  into 
a  single  algorithm.   To  do  this,  a  distinction  must  be 
drawn  between  column  labels  and  output  labels.   Recall 
the  algorithm  of  Section  II. D. 2  for  generating  transfer 
matrices  as  permutations  of  identity  matrices.   In  an  identity- 
matrix,  lfs  occur  only  in  those  positions  where  the  row  and 
column  indices  are  the  same.   For  the  chosen  transfer 
patterns  on  the  Butterfly  Network,  processors  transfer 
one-half  of  their  data  at  any  given  transfer  stage.   Thus, 
half  the  elements  move  off  the  main  diagonal,  and  half 
remain.   The  algorithm  for  generating  transfer  matrices 
gives  the  column  indices  where  l's  appear,  as  a  permutation 
of  the  binary  representation  of  the  (normal-ordered)  row 
indices.   For  half  the  row  indices,  this  involves  switching 
a  1  and  a  0,  and  the  corresponding  element  moves  off  the 
main  diagonal  (representing  transfer  of  a  data  point).   For 
the  other  half  of  the  row  indices,  the  algorithm  switches  a 
1  and  a  1 ,  or  a  0  and  a  0,  and  the  corresponding  elements' 
row  and  column  indices  remain  equal.   These  elements  remain 
on  the  main  diagonal,  and  represent  data  points  not 
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transferring  at  that  stage.   The  transfer  raarices  are 
independent,  in  the  sense  that  each  successive  transfer 
matrix  is  generated  by  permuting  the  normal-ordered  row 
indices  to  obtain  the  columns  where  1's  appear  at  that 
stage . 

The  effect  of  the  transfers  on  the  output  ordering  is 
cumulative,   however.   Recall  that  the  multi-processor  fast 
transform  output  can  be  viewed  as  a  reordering  of  the 
corresponding  single-processor  output.   For  example,  a 
single-processor  DIF  FFT  with  bit-reversed  input  produces 
normal-ordered  output.   Section  II. C  showed  that  the  DIT 
multi-processor  FFT  output  is  produced  in  pseudo-normal 
order.   Recall  also  that  pseudo-normal  order  is  a  function 
of  the  number  of  active  processors,  P,  hence  the  number  of 
transfer  stages.   That  is,  the  pseudo-normal  ordering  for 
a  16-point /8-processor  DIT  FFT  is  different  from  that  of 
a  16-point /4-processor  DIT  FFT.   In  the  first  case, 
3(log98)  transfers  are  necessary,  while  the  4-processor 
case  only  requires  2  transfer  stages.   Thus,  the  pseudo- 
normal  ordering  of  the  8-processor  case  is  permuted 
"farther"  in  some  sense  from  the  single-processor 
normal-ordered  output. 

With  the  above  background  in  mind,  the  algorithm  for 
generating  transfer  matrices  is  extended  to  also  give 
output  reordering  as  follows.   A  distinction  is  drawn 
between  column  indices  and  output  ordering  indices. 
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These  are  both  represented  as  binary  numbers,  and  then 
permuted  according  to  the  algorithm  of  Section  .  II . D . 2 . 
Since  the  column  indices  are  generated  from  the  normal- 
ordered  row  indices  of  each  transfer  stage,  normal-order 
is  the  starting  point  for  generating  each  transfer  matrix. 
The  output  ordering  begins  in  the  appropriate  single- 
processor  output  ordering  for  the  fast  transform  being 
computed,  then  is  cumulatively  permuted  by  successive 
transfers . 

Several  examples  to  illustrate  the  combined  algorithm 
are  shown  in  Table  2.   This  table  illustrates  the  transfer 
matrix/output  reordering  of  the  DIT  multi-processor  FFT. 
Recall  that  the  single-processor  DIT  FFT  with  bit-reversed 
input  produces  normal-ordered  output.   Likewise,  for 
normal-ordered  input,  the  single-processor  DIT  FFT  generates 
output  in  bit-reversed  order.   These  orderings  are  the 
starting  points  for  the  output  reorderings  shown  in 
Table  2.   Note  that  the  bit-switching  algorithm  is  applied 
literally  to  the  output  indices'  bit  labels,  regardless  of 
what  position  the  bit  labels  occupy  at  a  given  stage.   Also 
note  that  due  to  the  cumulative  effect  of  transfers  oh  the 
output  ordering,  the  first  transfer  of  any  example  is  the 
only  time  the  column  indices  and  the  output  indices  are  the 
same.   It  is  also  obvious  from  Table  2  that  pseudo-normal 
and  pseudo-bit-reversed  ordering  depend   on  the  number  of 
transfers  necessary.   The  first  transfer  stage  of  the 
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TABLE  2:  COMBINED  ALGORITHM  FOR  GENERATING  TRANSFER  MATRICES 
AND  OUTPUT  ORDERING  (DIT  MPFFT) 


16  Points/8  Processors  : 


Transfer   Matr 

ix 

Row    Indices 

Column    Ie 

.dices 

Output 

Ordering 

b    b   b   b 
3    2    10 

Normal : 

b3b2blb0 

BR:    b0b1b2b3 

Tl    (b0   "    bi) 

b3b2b0bl 

Tl  (b0   " 

•   bi)b3b2b0b1 

T3'(b0  -    b1)b1b0b2b3 

T2    (b0  "   b2) 

b3b0b1b2 

T2(b0  - 

•   b2)b3bob2b1 

T2(b0  -    b2)  b1b2b0b3 

T3    (bo  -   b3) 

bob2blb3 

T3(b0  - 

b3)  b0b3b2b1 
PN:    b0b3b2b1 

Tl'(bo  -   b3)   bib2b3b0 
PBR:    blb2b3b0 

16   Points/4   Processors  : 

Transfer   Matr 

ix 

Row   Indices 

Column    In 

dices 

Output 

Ordering 

b3b2blb0 

Normal 

:    b3b2b1b0 

BR:    b0b1b2b3 

Ti     (bi    -   bs) 

b3blb2b0 

Tx(bi   - 

-   b2)  b3b1b2b0 

T2  (bx   -    b2)    b0b2b1b3 

T2    (bi    -   b3) 

bib2b3bo 

T2(bi   - 

-   b3)  b1b3b2b0 
PN:    b^b^o 

Ti(bi    -    b3)     b0b2b3b1 
PBR:    b0b2b3b1 

16    Points/2    Processors : 

Transfer   Matrix 

Row    Indices 

Column    Indices 

Output 

Ordering 

b3b2blbo 

Normal 

:    b3b2b1b0 

BR:    b0b1b2b3 

Tl    (b2    "   b3) 

b2b3blb0 

Tl  (b2  - 

-   b3)  b2b3b1b0 
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Tl'(b2    -    b3)     bQb1b3b2 
PBR:    bob-jb3b2 

8    Points/4   Processors  : 

Transfer   Matrix 

■ 

Row   Indices 

Column    Indices 

OutDUt 

Ordering 

b2b1b0 

Normal : 

b2blb0 

BR:    bQbib2 

Tl     (b0   -    bx) 

b2b0bl 

Tl     (b0 

-   b-j)  b2b0b1 

T2  (b0    "   bl)     b1b0b2 
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b0blb2 

T2   (b0 

-    b2)  bQb2b1 
PN:    bQb2b1 

Ti'Oo    -    b2)     V2b0 
PBR:    bib2bo 

8   Points/2    Processors  : 

Transfer   Matrix 

Row   Indices 

Column    It 

idices 

Output 

Ordering 

bob,brt 
2    10 

Normal : 

b2b1b0 

BR:    bQb1b2 

Tl    (bi    -   b2) 

b1b2bQ 

Tl    (bi 

"   b2>    bib2b0 
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16-point /8-processor  example  is  not  required  when  using  only 
4  or  2  processors,  and  is  effectively  bypassed  in  the  latter 

two  cases.   It  can  also  be  verified,  by  comparing  Table  2 
with  Table  1,  that  the  output  ordering  produced   is  the  same 
when  using  the  combined  algorithm. 

Care  must  be  taken  in  applying  this  algorithm,  depending 
on  the  type  of  fast  transform  being  computed.   The  algorithm 
from  Section  II. D . 2  and  generalized  here,  was  developed 
for  DIT  FFT  transfer  patterns.   It  can  be  used,  with  slight 
modifications  to  the  sequence  of  bit-switching,  for  all 
other  fast  transforms  presented  herein.   The  modifications 
are  discussed  in  the  sections  presenting  the  various  other 
fast  transforms. 
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V.   CONCLUSIONS 

Matrix  factorization  is  a  useful  method  of  representing 
distributed  fast  transforms.   While  not  suggested  as  a 
method  of  actually  computing  the  transforms ,  this 
representation  provides  insight  into  which  points  combine 
during  computation  stages,  and  into  the  transfer  patterns  of 
the  Butterfly  Network.   Additional  insight  is  gained  in 
terms  of  the  output  reordering  caused  by  the  inter-processor 
data  transfers . 

Matrix  partitioning  and  factorization  have  been  widely 
used  to  describe  single-processor  fast  transform 
algorithms.   This  method  was  shown  to  also  be  useful  in 
describing  distributed  fast  transforms.   Additional  matrix 
factors  are  required  to  represent  the  necessary  inter- 
processor  transfer  stages.   These  "transfer  matrices"  may 
be  generated  by  permutations  of  identity  matrices  for 
selected  transfer  patterns  on  the  Butterfly  Network.   The 
transfer  structure  was  shown  to  determine  the  ordering  of 
the  transformed  output  sequence.   The  output  ordering  can 
be  described  in  terms  of  a  permutaton  of  the  single-processor 
output  produced  by  a  given  fast  transform  algorithm. 

The  Butterfly  Network  also  supports  the  Fast  Hartley 
Transform,  and  the  Walsh-Hadamard  Transform.   Transfer 
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patterns  similar  to  those  required  for  the  multi-processor 
FFT  can  also  be  used  for  these  other  fast  transforms.   The 
output  orderings  produced  are  unique,  but  easily  decipherable. 

Many  equivalent  signal  flowgraphs  can  be  constructed  to 
represent  any  given  fast  transform  algorithm.   This 
flexibility  also  permits  many  equivalent  matrix  factorizations 
to  be  developed.   Constraints  include  the  need  to  minimize 
additional  memory  requirements  for  buffering,  or  computing 
"in-place"  as  much  as  possible,  and  the  desire  to  produce 
an  orderly  and  decipherable  output  sequence.   Additionally, 
for  distributed  algorithms,  an  orderly  pattern  of  inter- 
processor  data  transfers  is  also  desirable.   The  Butterfly 
Network  was  shown  to  meet  these  constraints  for  the 
computation  of  distributed  FFTs ,  and  to  conveniently 
support  several  other  fast  transform  algorithms. 

Further  work  would  be  desirable  on  developing  a  general 
method  for  finding  the  matrix  factors  of  the  distributed  form 
of  a  given  fast  transform  algorithm.   An  optimal  implementation 
of  the  FHT  (possibly  the  one  given  in  Section  III. A. 2)  on 
the  Butterfly  Network  is  also  worthy  of  further  research, 
as  the  FHT  appears  -to  have  a  quite  promising  future. 
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